knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse) # general
library(ggalt) # dumbbell plots
library(grid) # plots
library(gridExtra) # plots
library(ggcorrplot) 
library(ggplot2)
library(rworldmap) # quick country-level heat map
library(countrycode) # continent
library(broom) # significant trends within countries
library(lubridate) # dealing with dates
library(car)
library(scales)

1.Obtaining Data

1.1 Suicide Data

The dataset was sourced from links and contains a comprehensive collection of suicide statistics broken down by country, year, age, and sex. This data has been gathered from multiple datasets with the intention of identifying trends and patterns in global suicide rates. Here are the features of this dataset:

  • Country: The country where the data was recorded.

  • Year: The year when the data was recorded. The data was collect from 1985 until 2016.

  • Sex: The sex of the individuals included in the suicide count.

  • Age: The age group of the individuals included in the suicide count.

  • Suicide_no: The total number of suicides recorded in a specific country, year, age group, and sex.

  • Population: The total population of the specific age and sex group in the country for that year.

  • Suicide/ 100k pop: This is a derived metric representing the number of suicides per 100,000 people in the population, calculated as (Suicide_no / Population) * 100000.

  • HDI_for_year: Human Development Index, a statistic composite index of life expectancy, education, and per capita income indicators.

  • GDP_for_year ($): The gross domestic product (GDP) of the country for the specified year, measured in US dollars.

  • GDP_per_capita ($): The GDP per capita of the country for the specified year, also measured in US dollars. It is calculated by dividing the GDP_for_year ($) by the total population of the country.

  • Generation: The generation cohort of the individuals included in the suicide count (e.g., Gen X, Boomers, etc.).

suicide_data <- read_csv("suicide_data.csv", show_col_types = FALSE)
head(suicide_data)
dim(suicide_data)
[1] 27820    12

We further supplemented our analysis by incorporating additional features, which we hypothesize may significantly impact suicide rates.

1.2 Continet

We’ve enriched our dataset by appending a ‘continent’ feature corresponding to each country. This enhancement, accomplished utilizing the ‘countrycode’ library, will facilitate subsequent geographical analyses.

# getting continent data:
suicide_data$continent <- countrycode(sourcevar = suicide_data$country,
                              origin = "country.name",
                              destination = "continent")

dim(suicide_data)
[1] 27820    13

1.3 Life Expectancy

We also added ‘life expectancy’ data from links. Remember, the Human Development Index (HDI) also measures life expectancy, so these two features might be highly correlated. We’ll keep this in mind for our analysis.

life_exp_data <-read_csv('life_exp.csv',show_col_types = FALSE)
head(life_exp_data)
life_exp_data$Time <- as.integer(life_exp_data$Time)

life_exp_data <- life_exp_data %>%
  rename( life_exp = `Value`,
          year = `Time`,
          country = `Country Name`) %>%
  as.data.frame()

We’ve identified that the format of some country names differs between the suicide data and the life expectancy dataset. To prevent any further discrepancies, it’s imperative that we address and rectify these inconsistencies.

name_changes_life <- c("Bahamas, The" = "Bahamas", 
                  "Czechia" = "Czech Republic",
                  "Kyrgyz Republic" = "Kyrgyzstan",
                  "Macao SAR, China" = "Macau",
                  "Korea, Rep." = "Republic of Korea",
                  "St. Kitts and Nevis" = "Saint Kitts and Nevis",
                  "St. Lucia" = "Saint Lucia",
                  "St. Vincent and the Grenadines" = "Saint Vincent and Grenadines",
                  "Slovak Republic" = "Slovakia",
                  "Turkiye" = "Turkey"
                  )

life_exp_data$country[life_exp_data$country %in% names(name_changes_life)] <- 
    name_changes_life[life_exp_data$country[life_exp_data$country %in% names(name_changes_life)]]
data <- suicide_data %>%
  left_join(life_exp_data[, c('year', 'country', 'life_exp')], 
            by = c('year', 'country'))
dim(data)
[1] 27820    14

1.4 Temperature

Our investigation suggests that temperature might influence suicide rates.(source: links) So, we incorporated temperature data into our dataset from link to further explore this possibility. This dataset covers temperature data since 1750 till 2013. We include three additional features:

  • max_temp: The highest monthly temperature for each year.
  • min_temp: The lowest monthly temperature for each year.
  • avg_temp: The average monthly temperature for each year.
temp_data <- read_csv("GlobalLandTemperaturesByCountry.csv",show_col_types = FALSE)
head(temp_data)
# Extract year and month from date_column
temp_data$year <- year(temp_data$dt)
temp_data$month <- month(temp_data$dt)

temp_data <- select(temp_data, -dt, -"AverageTemperatureUncertainty")
temp_data<- rename(temp_data, country = Country)

temp_data$year <-as.integer(temp_data$year)
temp_data$month <-as.integer(temp_data$month)
temp_data$AverageTemperature <-as.numeric(temp_data$AverageTemperature)

temp_data <- temp_data %>%
  filter(year >= 1985 & year <= 2016)

temp_data <- temp_data %>%
  group_by(country, year) %>%
  mutate(
    avg_temp = sum(AverageTemperature, na.rm = TRUE) / sum(!is.na(AverageTemperature)),
    max_temp = max(AverageTemperature, na.rm = TRUE),
    min_temp = min(AverageTemperature, na.rm = TRUE)
  ) %>%
  ungroup()
Warning: There were 58 warnings in `mutate()`.
The first warning was:
ℹ In argument: `max_temp = max(AverageTemperature, na.rm = TRUE)`.
ℹ In group 233: `country = "Antarctica"`, `year = 1985`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 57 remaining warnings.
temp_data <- select(temp_data, -month, -"AverageTemperature")

temp_data <-distinct(temp_data)

temp_data <- temp_data %>%
  filter_all(all_vars(!is.infinite(.)))

As with the life expectancy data, we’ve noticed discrepancies in the formatting of country names between the suicide data and temperature dataset. To avoid any potential issues down the line, it’s crucial that we reconcile these inconsistencies and standardize the country names across all datasets.

name_changes_temp <- c("Antigua And Barbuda" = "Antigua and Barbuda", 
                  "Bosnia And Herzegovina" = "Bosnia and Herzegovina",
                  "South Korea" = "Republic of Korea",
                  "Russia" = "Russian Federation",
                  "Saint Kitts And Nevis" = "Saint Kitts and Nevis",
                  "Trinidad And Tobago" = "Trinidad and Tobago",
                  "Saint Vincent And The Grenadines" = "Saint Vincent and Grenadines"
                  )
temp_data$country[temp_data$country %in% names(name_changes_temp)] <- 
    name_changes_temp[temp_data$country[temp_data$country %in% names(name_changes_temp)]]
data <- data %>%
  left_join(temp_data[, c('year', 'country', 'avg_temp', 'max_temp', 'min_temp')], 
            by = c('year', 'country'))
dim(data)
[1] 27820    17

Now that we’ve collected all the necessary data, our next step is to clean and preprocess this data for further analysis.

2.Clean and Filter Data

glimpse(head(data, 8))
Rows: 8
Columns: 17
$ country              <chr> "Albania", "Albania", "Albania", "Albania", "Albania", "Alban…
$ year                 <dbl> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987
$ sex                  <chr> "male", "male", "female", "male", "male", "female", "female",…
$ age                  <chr> "15-24 years", "35-54 years", "15-24 years", "75+ years", "25…
$ suicides_no          <dbl> 21, 16, 14, 1, 9, 1, 6, 4
$ population           <dbl> 312900, 308000, 289700, 21800, 274300, 35600, 278800, 257200
$ `suicides/100k pop`  <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1.56
$ `country-year`       <chr> "Albania1987", "Albania1987", "Albania1987", "Albania1987", "…
$ `HDI for year`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA
$ `gdp_for_year ($)`   <dbl> 2156624900, 2156624900, 2156624900, 2156624900, 2156624900, 2…
$ `gdp_per_capita ($)` <dbl> 796, 796, 796, 796, 796, 796, 796, 796
$ generation           <chr> "Generation X", "Silent", "Generation X", "G.I. Generation", …
$ continent            <chr> "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "…
$ life_exp             <dbl> 72.352, 72.352, 72.352, 72.352, 72.352, 72.352, 72.352, 72.352
$ avg_temp             <dbl> 12.99658, 12.99658, 12.99658, 12.99658, 12.99658, 12.99658, 1…
$ max_temp             <dbl> 24.084, 24.084, 24.084, 24.084, 24.084, 24.084, 24.084, 24.084
$ min_temp             <dbl> 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, 3.304
print(colnames(data))
 [1] "country"            "year"               "sex"                "age"               
 [5] "suicides_no"        "population"         "suicides/100k pop"  "country-year"      
 [9] "HDI for year"       "gdp_for_year ($)"   "gdp_per_capita ($)" "generation"        
[13] "continent"          "life_exp"           "avg_temp"           "max_temp"          
[17] "min_temp"          
sapply(data, function(x) length(unique(x)))
           country               year                sex                age 
               101                 32                  2                  6 
       suicides_no         population  suicides/100k pop       country-year 
              2084              25564               5298               2321 
      HDI for year   gdp_for_year ($) gdp_per_capita ($)         generation 
               306               2321               2233                  6 
         continent           life_exp           avg_temp           max_temp 
                 5               2177               2146               2011 
          min_temp 
              2082 

2.1 Columns and Values

data <- data %>% 
  rename(suicide_ratio = `suicides/100k pop`, 
         country_year = `country-year`,
         HDI_for_year = `HDI for year`,
         GDP_for_year = `gdp_for_year ($)`, 
         GDP_per_capita = `gdp_per_capita ($)`) %>%
  as.data.frame()

data$age <- gsub(" years", "", data$age)

data$sex <- ifelse(data$sex == "male", "Male", "Female")

2.2 Missing Values

2.2.1 Data Scarcity by Country/Year

In an ideal dataset, every unique combination of country and year (country_year) would be represented by 12 entries (2 genders across 6 age groups). Now, we need to verify the completeness of our data for each country_year combination.

data %>%
  group_by(country_year) %>%
  count() %>% #this SHOULD give 12 rows for every county-year combination (6 age bands * 2 gender)
  filter(n != 12)

It appears that there is problem with 2016 data.

year_value_counts <- as.data.frame(sort(table(data$year), decreasing = FALSE))
names(year_value_counts) <- c("Year", "Count")
head(year_value_counts,5)

Our exploration reveals that the dataset for the year 2016 is not only sparse, but also incomplete for the few countries that have entries. Additionally, data for the years between 1985 and 1989 are also quite limited. These issues need to be addressed. As a solution, we’ve decided to exclude the data from 2016. We also drop the ‘country_year’ column from our dataset.

data <- data %>%
  filter(year != 2016) %>% # I therefore exclude 2016 data
  select(-country_year)

In the following step, we focus on filtering our dataset to ensure its robustness for further analysis. Specifically, we are addressing the issue of certain countries that have insufficient data spread across the years. These sparse data points can potentially skew our analysis or generate inaccurate insights. Therefore, we will systematically remove such countries from our dataset to maintain data integrity and reliability for subsequent steps in our study.

minimum_years <- data %>%
  group_by(country) %>%
  summarize(rows = n(), 
            years = rows / 12) %>%
  arrange(years)

minimum_years <- minimum_years %>%
  filter(minimum_years$years<=3)
  

data <- data %>%
filter(!(country %in% minimum_years$country))

dim(data)
[1] 27492    16
sapply(data, function(x) length(unique(x)))
       country           year            sex            age    suicides_no     population 
            93             31              2              6           2083          25292 
 suicide_ratio   HDI_for_year   GDP_for_year GDP_per_capita     generation      continent 
          5282            306           2291           2205              6              5 
      life_exp       avg_temp       max_temp       min_temp 
          2166           2135           2000           2073 

We’ve further refined our dataset, eliminating data from 2016 and from eight countries due to their sparse or incomplete data. This ensures a more reliable basis for our analysis

2.2.2 NA Values

Once we’ve eliminated the incomplete data, we’ll proceed to inspect each feature for the presence of null values.

na_counts <- sapply(data, function(x) sum(is.na(x))/nrow(data)*100)
na_counts_df <- data.frame(Feature = names(na_counts), NA_ratio = na_counts)
na_counts_df = na_counts_df %>% `rownames<-`( NULL )
print(na_counts_df)

Approximately 70% of the ‘HDI_for_year’ column contains null values, necessitating an adjustment. Despite comprehensive exploration, we were unable to find any reliable data to fill these null values in the HDI. Additionally, since the formula for calculating the HDI changed in 2010, the index before and after this year is not directly comparable. As we’ve added life expectancy data for each year, the decision has been made to drop the ‘HDI_for_year’ column.

data = subset(data, select = -c(HDI_for_year) )

Approximately 7.5% of the temperature data consists of null values which requires addressing. As a first step, we’ll identify the countries that contain null values in their temperature data.

data %>% 
  group_by(country) %>% 
  filter(all(is.na(min_temp))) %>% 
  pull(country) %>% 
  unique()
[1] "Maldives"

There’s only one country, the Maldives (located in South Asia), for which temperature data is unavailable. Given that we have ample data for Asia, we’ve made the decision to exclude the Maldives from our dataset.

data %>%
  group_by(continent) %>%
  summarise(num_countries = n_distinct(country))
countries_to_remove <- c("Maldives")

data <- data[!data$country %in% countries_to_remove, ]

Given that our temperature data concludes in 2013, we will fill the ‘avg_temp’, ‘min_temp’, and ‘max_temp’ fields for the years 2014 and 2015 using the corresponding data from 2013.

We observed that the data for Ukraine in 2013 is missing from our suicide dataset. Consequently, to address this absence, we will use the data from 2012 for this particular country to approximate the values for 2014 and 2015.

df_2013 <- data %>%
  filter(year == 2013) %>%
  select(country, avg_temp, min_temp, max_temp)

names(df_2013)[2:4] <- paste0(names(df_2013)[2:4], "_2013")

df_2012 <- data %>%
  filter(year == 2012) %>%
  select(country, avg_temp, min_temp, max_temp)

names(df_2012)[2:4] <- paste0(names(df_2012)[2:4], "_2012")

# Replace NA values in 2014 and 2015 using the lookup table
data <- data %>%
  mutate(year = as.character(year)) %>%
  rowwise() %>%
  mutate(
    avg_temp = ifelse(year %in% c("2014", "2015") & is.na(avg_temp) & country != "Ukraine",
                      df_2013$avg_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(avg_temp) & country == "Ukraine",
                             df_2012$avg_temp_2012[df_2012$country == country],
                             avg_temp)),
    min_temp = ifelse(year %in% c("2014", "2015") & is.na(min_temp) & country != "Ukraine",
                      df_2013$min_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(min_temp) & country == "Ukraine",
                             df_2012$min_temp_2012[df_2012$country == country],
                             min_temp)),
    max_temp = ifelse(year %in% c("2014", "2015") & is.na(max_temp) & country != "Ukraine",
                      df_2013$max_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(max_temp) & country == "Ukraine",
                             df_2012$max_temp_2012[df_2012$country == country],
                             max_temp))
  ) %>%
  ungroup()
data <- data %>%
  mutate(year = as.integer(year))
dim(data)
[1] 27372    15
na_counts <- sapply(data, function(x) sum(is.na(x))/nrow(data)*100)
na_counts_df <- data.frame(Feature = names(na_counts), NA_ratio = na_counts)
na_counts_df = na_counts_df %>% `rownames<-`( NULL )
print(filter(na_counts_df, NA_ratio>0))

With all missing values effectively handled, our dataset is now clean and ready for further analysis.

2.3 Factorizing Categorical Data

# Nominal factors
data_nominal <- c('country', 'sex', 'continent')

data[data_nominal] <- lapply(data[data_nominal], function(x){factor(x)})


# Making age ordinal
data$age <- factor(data$age, 
                   ordered = T, 
                   levels = c("5-14",
                              "15-24", 
                              "25-34", 
                              "35-54", 
                              "55-74", 
                              "75+"))

# Making generation ordinal
data$generation <- factor(data$generation, 
                   ordered = T, 
                   levels = c("G.I. Generation", 
                              "Silent",
                              "Boomers", 
                              "Generation X", 
                              "Millenials", 
                              "Generation Z"))

data <- as_tibble(data)

2.4 Outliers

Detecting and addressing outliers is a fundamental step in data preprocessing, especially for linear regression models that are significantly influenced by outliers.

Several methods exist to identify outliers, including:

  • Visual Inspection using Boxplots and Scatterplots: These plots offer a straightforward way to visually identify outliers. Boxplots are particularly useful for univariate analysis, while scatterplots facilitate bivariate analysis.

  • Z-Score Method: This technique labels any data point that deviates more than three standard deviations from the mean as an outlier. However, this method is only effective when the data is completely or nearly normally distributed. Hence, it’s not ideal for skewed data.

  • Tukey’s Fences:It is calculated by creating a “fence” boundary a distance of 1.5 IQR beyond the 1st and 3rd quartiles. Any data beyond these fences are considered to be outliers.This method provides a robust mechanism to spot outliers, even in skewed distributions.

Once outliers are identified, we can employ several strategies to handle them, such as:

  • Rescaling and Transforming Data: Techniques such as log transformation, square root transformation, or cube root transformation can help lessen the data skewness and mitigate the effects of outliers.

  • Truncation or Winsorization: This method caps the outliers at a specified percentile of the data, like the 5th or 95th percentile.

  • Removing Outliers: In extreme cases, when we are confident that an outlier arises from incorrect data entry or measurement, we might decide to eliminate these values to avoid their undue influence on our model. However, this method should be a last resort, as it might result in information loss and should be justified thoroughly.

Let’s begin by examining the distribution and range of our features to better understand the spread and dispersion of our data.

summary(data)
      country           year          sex           age        suicides_no   
 Argentina:  372   Min.   :1985   Female:13686   5-14 :4562   Min.   :    0  
 Austria  :  372   1st Qu.:1994   Male  :13686   15-24:4562   1st Qu.:    3  
 Belgium  :  372   Median :2002                  25-34:4562   Median :   26  
 Brazil   :  372   Mean   :2001                  35-54:4562   Mean   :  246  
 Chile    :  372   3rd Qu.:2008                  55-74:4562   3rd Qu.:  134  
 Colombia :  372   Max.   :2015                  75+  :4562   Max.   :22338  
 (Other)  :25140                                                             
   population       suicide_ratio       GDP_for_year       GDP_per_capita  
 Min.   :     278   Min.   :  0.0000   Min.   :4.692e+07   Min.   :   251  
 1st Qu.:  101399   1st Qu.:  0.9775   1st Qu.:9.210e+09   1st Qu.:  3424  
 Median :  441232   Median :  6.0900   Median :4.921e+10   Median :  9378  
 Mean   : 1869421   Mean   : 12.9199   Mean   :4.517e+11   Mean   : 16850  
 3rd Qu.: 1512871   3rd Qu.: 16.7625   3rd Qu.:2.627e+11   3rd Qu.: 24922  
 Max.   :43805214   Max.   :224.9700   Max.   :1.812e+13   Max.   :126352  
                                                                           
           generation      continent        life_exp        avg_temp          max_temp     
 G.I. Generation:2724   Africa  :  828   Min.   :53.98   Min.   :-18.085   Min.   :-2.065  
 Silent         :6262   Americas: 9156   1st Qu.:70.89   1st Qu.:  9.469   1st Qu.:19.398  
 Boomers        :4908   Asia    : 5148   Median :74.57   Median : 14.165   Median :24.004  
 Generation X   :6314   Europe  :11268   Mean   :74.10   Mean   : 15.888   Mean   :23.383  
 Millenials     :5714   Oceania :  972   3rd Qu.:77.74   3rd Qu.: 25.743   3rd Qu.:27.966  
 Generation Z   :1450                    Max.   :83.79   Max.   : 29.671   Max.   :38.842  
                                                                                           
    min_temp      
 Min.   :-33.783  
 1st Qu.: -2.155  
 Median :  5.740  
 Mean   :  7.952  
 3rd Qu.: 22.701  
 Max.   : 27.492  
                  

2.4.1 Visualize Distribution of the Data


# Set the overall layout for the combined plot
par(mfrow = c(3, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    hist(data[[col]], main=col, xlab=col, col = "#13527a", border = "#ebebeb", cex.main = 1)
  }
}

2.4.2 BoxPlots

# Set the overall layout for the combined plot
par(mfrow = c(2, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    boxplot(data[[col]], main=col, xlab=col, col = "#ebebeb", border = "#13527a", cex.main = 1)
  }
}

NA

From the above boxplots, it’s evident that “suicides_no”, “population”, and “GDP_for_year” all exhibit a significant number of outliers. Additionally, “suicide_ratio” and “GDP_per_capita” also show a substantial number of outlier values.

Our examination of the boxplots and histograms reveals a significant concentration of data within the ‘suicide_no’ and ‘suicide_ratio’ parameters, skewed towards zero. This high density around zero manifests as a long tail in the distribution towards the right. In order to address this skewness, we need to delve deeper into the records where ‘suicide_no’ is recorded as zero.

# Define a common theme
common_theme <- theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    legend.position = "none",
    panel.grid.major = element_line(color = "grey", linewidth = 0.1),
    panel.grid.minor = element_blank()
  )

# Define a common color
common_color <- "steelblue"

# Data for the first plot
zero_suicides_data <- data %>%
  filter(suicides_no == 0) %>%
  group_by(age) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = age, y = count, fill=age)) +
    geom_bar(stat = "identity") +
    labs(x = "Age", y = "Count", 
         title = "Zero Suicides by Age Group") +
    common_theme

# Data for the second plot
age_plot <- data %>%
  group_by(age) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
  ggplot(aes(x = age, y = suicide_per_100k, fill = age)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Global suicides per 100k, by Age",
       x = "Age", 
       y = "Suicides per 100k") +
  common_theme +
  scale_y_continuous(breaks = seq(0, 30, 1), minor_breaks = NULL) # Changed breaks for better visibility

# Arrange the plots
grid.arrange(age_plot, zero_suicides_data, ncol = 2)

Our observations suggest that a significant proportion of zero-suicide instances are within the 5-14 age bracket. The overall suicide average for this age group is notably low, generally under one.

This age group, however, doesn’t adequately represent the larger population. The reasons for suicide within this age bracket are likely to be fundamentally different from those of other groups, potentially influenced by unique causes.

Given these considerations, we have opted to exclude the 5-14 age group from our data. This decision stems from the realization that this group exhibits distinctly different behaviors and is not representative of the broader population in the context of suicide rates.

data <- data%>%
  filter(age != '5-14')
dim(data)
[1] 22810    15

2.4.3 Tukey’s Fences

As indicated by the histograms above, most of our data does not adhere to a normal distribution. To yield more specific results, we will initially employ Tukey’s Fences method to identify the number of outliers in each feature.

#check the number of outliers in each features

# Define the outlier_count function
Tukey_outlier_count <- function(col) {
  q75 <- quantile(col, 0.75, na.rm= TRUE)
  q25 <- quantile(col, 0.25, na.rm= TRUE)
  iqr <- q75 - q25
  min_val <- q25 - (iqr * 1.5)
  max_val <- q75 + (iqr * 1.5)
  outlier_count <- sum(col > max_val | col < min_val)
  outlier_percent <- round(outlier_count / length(col) * 100, 2)
  return(c(outlier_count, outlier_percent))
}

# Get numeric data
numeric_data <- data[, sapply(data, is.numeric)]

# Apply the function to numeric columns
outliers <- sapply(numeric_data, Tukey_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)

2.4.4 Transformations

The results from the Tukey’s method, consistent with our boxplot observations, reveal a significant proportion of outliers in the “suicides_no”, “suicide_ratio”, “population”, and “GDP_for_year” data. In an attempt to reduce the impact of these outliers, we plan to enrich our dataset with additional columns, each representing log-transformed and square root-transformed values of these variables.

However, we face a challenge with the “suicide_no” and “suicide_ratio” variables as they contain zero values, making it impossible to apply a straightforward log transformation. To circumvent this issue, we’ll introduce an adjustment factor, a constant c=1, to all suicide numbers. Subsequently, we’ll compute a new ratio and apply a log transformation to it. This approach ensures a smooth and successful transformation process.

# Add a small constant to avoid undefined log values
c <- 1

data <- data %>%
  mutate(new_suicides_no = suicides_no + c,
         new_suicide_ratio = new_suicides_no / population,
         log_population = log(population),
         log_GDP_year = log(GDP_for_year),
         log_GDP_capita = log(GDP_per_capita),
         log_suicide_no = log(new_suicides_no),
         log_suicide_ratio = log(new_suicide_ratio)
         )

In our analysis, we opted for the natural logarithm for its ease of interpretation. While logarithmic transformations with different bases don’t alter the distribution’s form, they do have implications for how we interpret the coefficients in our model.

With the natural logarithm (base e), coefficients in a model where both the predictor (x) and response (y) variables are log-transformed indicate the percentage change in y corresponding to a 1% change in x.

On the other hand, if a base-10 logarithm were used in the same circumstances, each coefficient would represent the change in y associated with a 10% change in x.

Therefore, by using the natural logarithm, we simplify the interpretation of our model’s output, enabling more straightforward conclusions and discussions.

data <- data %>%
  mutate(sqrt_population = sqrt(population),
         sqrt_GDP_year = sqrt(GDP_for_year),
         sqrt_GDP_capita = sqrt(GDP_per_capita),
         sqrt_suicide_no = sqrt(suicides_no),
         sqrt_suicide_ratio = sqrt(suicide_ratio)
         )
# Define the list of columns to be processed
transformed_col = c("population","log_population", "sqrt_population",
                    "GDP_for_year", "log_GDP_year", "sqrt_GDP_year", 
                    "GDP_per_capita", "log_GDP_capita", "sqrt_GDP_capita" , 
                    "suicides_no", "log_suicide_no", "sqrt_suicide_no",
                    "suicide_ratio","log_suicide_ratio", "sqrt_suicide_ratio")

# Apply the function to numeric columns
transformed_data <- data[, transformed_col]
transformed_outliers <- sapply(transformed_data, Tukey_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(transformed_outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)

# Set the overall layout for the combined plot
par(mfrow = c(3, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    hist(data[[col]], main=col, xlab=col, col = "#13527a", border = "#ebebeb", cex.main = 1)
  }
}

Upon applying the log transformation, we noticed a remarkable decrease in the number of outliers. Moreover, the transformed data showed a tendency towards a more normal distribution, indicating the effectiveness of the transformation.

With the data distribution now less skewed and more akin to a normal distribution, we leveraged the Z-score method to further quantify the remaining outliers in each column. This allowed us a more precise examination of the data spread and outlier prevalence.

#data = subset(data, select = -c(sqrt_population,
#                                sqrt_GDP_year,
#                                sqrt_GDP_capita,
#                                sqrt_suicide_no,
#                                sqrt_suicide_ratio))

2.4.5 Z_Score

Next, we employ the Z-Score method to identify potential outliers within each feature. To do this, we’ll establish upper and lower bounds, beyond which a data point will be classified as an outlier. The calculations for these boundaries are as follows:

Upper limit: Mean + (3 * Standard Deviation) Lower limit: Mean - (3 * Standard Deviation)

This method is based on the principle that for a normally distributed dataset, about 99.7% of data falls within three standard deviations from the mean. Hence, any data point beyond this range can be considered an outlier.

Z_Score <- function(col){
  return((col - mean(col)) / sd(col))
}

Z_outlier_count <- function(col) {
  Upper_limit = mean(col) + (3 * sd(col))
  Lower_limit = mean(col) - (3 * sd(col))
  outlier_count <- sum(col > Upper_limit| col < Lower_limit)
  outlier_percent <- round(outlier_count / length(col) * 100, 2)
  return(c(outlier_count, outlier_percent))
}

# Get numeric data
numeric_data <- data[, sapply(data, is.numeric)]

# Apply the function to numeric columns
outliers <- sapply(numeric_data, Z_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)

2.4.6 Exploring Outliers

In our project, “suicide_ratio” is the key variable we aim to predict using linear regression. It is important to acknowledge that linear regression models are particularly susceptible to the influence of outliers. Therefore, it is essential to adequately address and manage any outliers present in the “suicide_ratio” variable, to ensure our model’s accuracy and reliability.

let’s explore outliers in suicide ratio and check if there is any pattern in them. we use tukey’s fence dtected outliers.(to be edited)

# Calculate IQR and fences
Q1 <- quantile(data$suicide_ratio, 0.25)
Q3 <- quantile(data$suicide_ratio, 0.75)
IQR <- Q3 - Q1

lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR

# Filter outliers
suicide_outliers <- data %>% 
  filter(suicide_ratio < lower_fence | suicide_ratio > upper_fence)
summary(data)
      country           year          sex           age        suicides_no     
 Argentina:  310   Min.   :1985   Female:11405   5-14 :   0   Min.   :    0.0  
 Austria  :  310   1st Qu.:1994   Male  :11405   15-24:4562   1st Qu.:    6.0  
 Belgium  :  310   Median :2002                  25-34:4562   Median :   42.0  
 Brazil   :  310   Mean   :2001                  35-54:4562   Mean   :  292.9  
 Chile    :  310   3rd Qu.:2008                  55-74:4562   3rd Qu.:  180.0  
 Colombia :  310   Max.   :2015                  75+  :4562   Max.   :22338.0  
 (Other)  :20950                                                               
   population       suicide_ratio     GDP_for_year       GDP_per_capita  
 Min.   :     278   Min.   :  0.00   Min.   :4.692e+07   Min.   :   251  
 1st Qu.:   97662   1st Qu.:  2.99   1st Qu.:9.210e+09   1st Qu.:  3424  
 Median :  431350   Median :  8.58   Median :4.921e+10   Median :  9378  
 Mean   : 1875252   Mean   : 15.38   Mean   :4.517e+11   Mean   : 16850  
 3rd Qu.: 1479941   3rd Qu.: 20.00   3rd Qu.:2.627e+11   3rd Qu.: 24922  
 Max.   :43805214   Max.   :224.97   Max.   :1.812e+13   Max.   :126352  
                                                                         
           generation      continent       life_exp        avg_temp          max_temp     
 G.I. Generation:2724   Africa  : 690   Min.   :53.98   Min.   :-18.085   Min.   :-2.065  
 Silent         :6262   Americas:7630   1st Qu.:70.89   1st Qu.:  9.469   1st Qu.:19.398  
 Boomers        :4908   Asia    :4290   Median :74.57   Median : 14.165   Median :24.004  
 Generation X   :5688   Europe  :9390   Mean   :74.10   Mean   : 15.888   Mean   :23.383  
 Millenials     :3228   Oceania : 810   3rd Qu.:77.74   3rd Qu.: 25.743   3rd Qu.:27.966  
 Generation Z   :   0                   Max.   :83.79   Max.   : 29.671   Max.   :38.842  
                                                                                          
    min_temp       new_suicides_no   new_suicide_ratio   log_population    log_GDP_year  
 Min.   :-33.783   Min.   :    1.0   Min.   :5.830e-07   Min.   : 5.628   Min.   :17.66  
 1st Qu.: -2.155   1st Qu.:    7.0   1st Qu.:4.955e-05   1st Qu.:11.489   1st Qu.:22.94  
 Median :  5.740   Median :   43.0   Median :1.101e-04   Median :12.975   Median :24.62  
 Mean   :  7.952   Mean   :  293.9   Mean   :1.852e-04   Mean   :12.790   Mean   :24.56  
 3rd Qu.: 22.701   3rd Qu.:  181.0   3rd Qu.:2.296e-04   3rd Qu.:14.208   3rd Qu.:26.29  
 Max.   : 27.492   Max.   :22339.0   Max.   :3.597e-03   Max.   :17.595   Max.   :30.53  
                                                                                         
 log_GDP_capita   log_suicide_no   log_suicide_ratio sqrt_population   sqrt_GDP_year    
 Min.   : 5.525   Min.   : 0.000   Min.   :-14.354   Min.   :  16.67   Min.   :   6850  
 1st Qu.: 8.139   1st Qu.: 1.946   1st Qu.: -9.912   1st Qu.: 312.51   1st Qu.:  95966  
 Median : 9.146   Median : 3.761   Median : -9.114   Median : 656.77   Median : 221832  
 Mean   : 9.063   Mean   : 3.613   Mean   : -9.177   Mean   : 971.62   Mean   : 411724  
 3rd Qu.:10.124   3rd Qu.: 5.198   3rd Qu.: -8.379   3rd Qu.:1216.53   3rd Qu.: 512552  
 Max.   :11.747   Max.   :10.014   Max.   : -5.628   Max.   :6618.55   Max.   :4256843  
                                                                                        
 sqrt_GDP_capita  sqrt_suicide_no   sqrt_suicide_ratio
 Min.   : 15.84   Min.   :  0.000   Min.   : 0.000    
 1st Qu.: 58.52   1st Qu.:  2.449   1st Qu.: 1.729    
 Median : 96.84   Median :  6.481   Median : 2.929    
 Mean   :111.94   Mean   : 10.602   Mean   : 3.227    
 3rd Qu.:157.87   3rd Qu.: 13.416   3rd Qu.: 4.472    
 Max.   :355.46   Max.   :149.459   Max.   :14.999    
                                                      
summary(suicide_outliers)
               country          year          sex          age       suicides_no     
 Russian Federation: 109   Min.   :1985   Female:  53   5-14 :  0   Min.   :    1.0  
 Kazakhstan        :  99   1st Qu.:1995   Male  :1570   15-24: 44   1st Qu.:   72.0  
 Lithuania         :  94   Median :2001                 25-34:171   Median :  204.0  
 Ukraine           :  88   Mean   :2001                 35-54:255   Mean   : 1107.4  
 Hungary           :  86   3rd Qu.:2007                 55-74:336   3rd Qu.:  859.5  
 Belarus           :  78   Max.   :2015                 75+  :817   Max.   :22338.0  
 (Other)           :1069                                                             
   population       suicide_ratio     GDP_for_year       GDP_per_capita  
 Min.   :     889   Min.   : 45.52   Min.   :4.692e+07   Min.   :   425  
 1st Qu.:  103700   1st Qu.: 54.26   1st Qu.:1.654e+10   1st Qu.:  2891  
 Median :  273776   Median : 66.90   Median :4.552e+10   Median :  7458  
 Mean   : 1636117   Mean   : 72.65   Mean   :3.840e+11   Mean   : 13201  
 3rd Qu.: 1274659   3rd Qu.: 84.08   3rd Qu.:1.959e+11   3rd Qu.: 19012  
 Max.   :21476420   Max.   :224.97   Max.   :8.100e+12   Max.   :113120  
                                                                         
           generation     continent       life_exp        avg_temp          max_temp     
 G.I. Generation:447   Africa  :   8   Min.   :61.42   Min.   :-18.085   Min.   :-2.065  
 Silent         :690   Americas: 229   1st Qu.:69.01   1st Qu.:  6.761   1st Qu.:18.369  
 Boomers        :265   Asia    : 298   Median :72.43   Median :  9.377   Median :21.479  
 Generation X   :187   Europe  :1080   Mean   :72.66   Mean   : 10.965   Mean   :21.537  
 Millenials     : 34   Oceania :   8   3rd Qu.:76.20   3rd Qu.: 13.192   3rd Qu.:24.358  
 Generation Z   :  0                   Max.   :83.20   Max.   : 28.849   Max.   :36.894  
                                                                                         
    min_temp        new_suicides_no   new_suicide_ratio   log_population   log_GDP_year  
 Min.   :-33.7830   Min.   :    2.0   Min.   :0.0004552   Min.   : 6.79   Min.   :17.66  
 1st Qu.: -7.6490   1st Qu.:   73.0   1st Qu.:0.0005556   1st Qu.:11.55   1st Qu.:23.53  
 Median : -2.5030   Median :  205.0   Median :0.0006902   Median :12.52   Median :24.54  
 Mean   : -0.3596   Mean   : 1108.4   Mean   :0.0007523   Mean   :12.68   Mean   :24.72  
 3rd Qu.:  2.5870   3rd Qu.:  860.5   3rd Qu.:0.0008599   3rd Qu.:14.06   3rd Qu.:26.00  
 Max.   : 27.2080   Max.   :22339.0   Max.   :0.0033746   Max.   :16.88   Max.   :29.72  
                                                                                         
 log_GDP_capita   log_suicide_no    log_suicide_ratio sqrt_population   sqrt_GDP_year    
 Min.   : 6.052   Min.   : 0.6931   Min.   :-7.695    Min.   :  29.82   Min.   :   6850  
 1st Qu.: 7.969   1st Qu.: 4.2905   1st Qu.:-7.495    1st Qu.: 322.02   1st Qu.: 128594  
 Median : 8.917   Median : 5.3230   Median :-7.278    Median : 523.24   Median : 213353  
 Mean   : 8.866   Mean   : 5.4378   Mean   :-7.244    Mean   : 888.36   Mean   : 395976  
 3rd Qu.: 9.853   3rd Qu.: 6.7575   3rd Qu.:-7.059    3rd Qu.:1129.01   3rd Qu.: 442612  
 Max.   :11.636   Max.   :10.0141   Max.   :-5.691    Max.   :4634.27   Max.   :2846085  
                                                                                         
 sqrt_GDP_capita  sqrt_suicide_no   sqrt_suicide_ratio
 Min.   : 20.62   Min.   :  1.000   Min.   : 6.747    
 1st Qu.: 53.77   1st Qu.:  8.485   1st Qu.: 7.366    
 Median : 86.36   Median : 14.283   Median : 8.179    
 Mean   : 99.74   Mean   : 23.140   Mean   : 8.423    
 3rd Qu.:137.88   3rd Qu.: 29.317   3rd Qu.: 9.169    
 Max.   :336.33   Max.   :149.459   Max.   :14.999    
                                                      

The analysis shows that the top six countries with outlier suicide ratios are the Russian Federation, Kazakhstan, Ukraine, Lithuania, Hungary, and Belarus. Intriguingly, these nations are not only geographically proximate, but also share cultural and historical links. This observation may imply potential regional trends or shared socio-economic factors influencing the elevated suicide ratios.

Furthermore, a striking detail emerges from the outliers: nearly all, or 96.7%, are men. This finding indicates a significantly higher incidence of extreme suicide ratios among men.

dim(suicide_outliers)
[1] 1623   27

Due to the nature of our data, and the analysis we performed we believe that removing outliers or applying trunication(Winsorization) will cause information loss. so we keep the outliers for EDA.and we will try different methods on modeling the data to see which perform the best on our data. these are the methods we will try: 1. removing detected outliers with both Tukey’s Fence and Z_score method. With removing outliers However, this can be risky because it assumes that the outliers are not informative and may lead to biased estimates.

  1. Robust Regression Methods: Given the number of outliers and their potential influence on the model, a robust regression method might be a good choice. These methods are less sensitive to outliers and can often provide better predictive performance when outliers are present.

  2. Other Machine learning models which are less sensetive to outliers. (to be edited )

3. Explore Data

In this forthcoming section, we dive into the exploration of our dataset, distinguishing variables into four distinct categories.

Firstly, we have ‘year’ which falls under time-dependent variables, mapping the temporal evolution of our data.

Secondly, we have a set of geographical and meteorological variables. These include ‘continent’, ‘country’, ‘population’, and a range of temperature parameters (minimum, maximum, average) alongside their transformations, offering us insights into regional and environmental influences.

Our third category brings together social and economic variables such as ‘life expectancy’, ‘GDP’, and ‘GDP per capita’. These, along with their respective transformations, capture the socio-economic backdrop against which we observe our data.

Lastly, our fourth category comprises demographic variables, namely ‘sex’ and ‘age’, allowing us to examine the influence of these vital demographics on our data.

Moreover, we have identified three potential target variables for our study: ‘suicide_ratio’, ‘log_suicide_ratio’, and ‘sqrt_suicide_ratio’. Of these, ‘log_suicide_ratio’ has been found to be highly effective in minimizing the impact of outliers. Yet, our exploration won’t be limited to it. We aim to thoroughly investigate the impact of all variables on each potential target until we embark on the modeling phase, where we will select the most suitable target variable for our predictive model.

glimpse(data)
Rows: 22,810
Columns: 27
$ country            <fct> Albania, Albania, Albania, Albania, Albania, Albania, Albania, …
$ year               <int> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 198…
$ sex                <fct> Male, Male, Female, Male, Male, Female, Female, Female, Male, F…
$ age                <ord> 15-24, 35-54, 15-24, 75+, 25-34, 75+, 35-54, 25-34, 55-74, 55-7…
$ suicides_no        <dbl> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 2, 17, 1, 14, 4, 8, 3, 5, 5, 4…
$ population         <dbl> 312900, 308000, 289700, 21800, 274300, 35600, 278800, 257200, 1…
$ suicide_ratio      <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1.56, 0.73, 0.00, 5.4…
$ GDP_for_year       <dbl> 2156624900, 2156624900, 2156624900, 2156624900, 2156624900, 215…
$ GDP_per_capita     <dbl> 796, 796, 796, 796, 796, 796, 796, 796, 796, 796, 769, 769, 769…
$ generation         <ord> Generation X, Silent, Generation X, G.I. Generation, Boomers, G…
$ continent          <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Europe,…
$ life_exp           <dbl> 72.352, 72.352, 72.352, 72.352, 72.352, 72.352, 72.352, 72.352,…
$ avg_temp           <dbl> 12.99658, 12.99658, 12.99658, 12.99658, 12.99658, 12.99658, 12.…
$ max_temp           <dbl> 24.084, 24.084, 24.084, 24.084, 24.084, 24.084, 24.084, 24.084,…
$ min_temp           <dbl> 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, 3.304, …
$ new_suicides_no    <dbl> 22, 17, 15, 2, 10, 2, 7, 5, 2, 1, 3, 18, 2, 15, 5, 9, 4, 6, 6, …
$ new_suicide_ratio  <dbl> 7.031000e-05, 5.519481e-05, 5.177770e-05, 9.174312e-05, 3.64564…
$ log_population     <dbl> 12.653639, 12.637855, 12.576601, 9.989665, 12.521978, 10.480101…
$ log_GDP_year       <dbl> 21.49181, 21.49181, 21.49181, 21.49181, 21.49181, 21.49181, 21.…
$ log_GDP_capita     <dbl> 6.679599, 6.679599, 6.679599, 6.679599, 6.679599, 6.679599, 6.6…
$ log_suicide_no     <dbl> 3.0910425, 2.8332133, 2.7080502, 0.6931472, 2.3025851, 0.693147…
$ log_suicide_ratio  <dbl> -9.562596, -9.804642, -9.868551, -9.296518, -10.219393, -9.7869…
$ sqrt_population    <dbl> 559.3747, 554.9775, 538.2379, 147.6482, 523.7366, 188.6796, 528…
$ sqrt_GDP_year      <dbl> 46439.48, 46439.48, 46439.48, 46439.48, 46439.48, 46439.48, 464…
$ sqrt_GDP_capita    <dbl> 28.21347, 28.21347, 28.21347, 28.21347, 28.21347, 28.21347, 28.…
$ sqrt_suicide_no    <dbl> 4.582576, 4.000000, 3.741657, 1.000000, 3.000000, 1.000000, 2.4…
$ sqrt_suicide_ratio <dbl> 2.5903668, 2.2781571, 2.1977261, 2.1424285, 1.8110770, 1.676305…
column_name <- colnames(data)

3.1 Time-Dependent

# the global rate over the time period will be useful:
global_average <- (sum(as.numeric(data$suicides_no)) / sum(as.numeric(data$population))) * 100000

data %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000) %>%
  ggplot(aes(x = year, y = suicides_per_100k)) + 
  geom_line(col = "red", linewidth = 1) + 
  geom_point(col = "red", size = 2) + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", linewidth = 1) +
  labs(title = "Global Suicides (per 100k)",
       subtitle = "Trend over time, 1985 - 2015.",
       x = "Year", 
       y = "Suicides per 100k") + 
  scale_x_continuous(breaks = seq(1985, 2015, 2)) + 
  scale_y_continuous(breaks = seq(10, 20))

The plot above yields several insightful observations:

  • The highest suicide rate recorded was 18.7 deaths per 100k population, observed in 1995.
  • This rate has seen a consistent decrease, falling to 13.5 per 100k population by 2015, which translates to a significant reduction of about 27%.
  • Presently, the rates are gradually regressing towards the figures prevalent prior to the 1990s.

However, a crucial aspect to remember is that the data available from the 1980s is relatively scarce, thus making it difficult to conclusively state whether these rates were an accurate reflection of the global suicide trends during that period.

3.1.1 Why did people killed themselves in 1995?

data_95 <- data %>%
  filter(year == 1995) %>%
  group_by(country) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)

data_95 <- data_95 %>%
  arrange(desc(suicides_per_100k))

head(data_95)

During our exploration of outliers, we observed that Eastern European countries have significantly higher suicide rates compared to other nations. In particular, it appears that a substantial number of suicides in 1995 were reported in Russia.

data_without_ru <- data %>%
  filter(country != "Russian Federation") %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)


yearly_data <- data %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)

data_without_ru <- data_without_ru %>%
  mutate(inclusion = "Without Russia")

yearly_data <- yearly_data %>%
  mutate(inclusion = "With Russia")

combined_data <- bind_rows(data_without_ru, yearly_data)

ggplot(combined_data, aes(x = year, y = suicides_per_100k, color = inclusion)) +
  geom_line() +
  labs(
    title = 'Number of Suicides for every 100k people', 
    x = 'Year', 
    y = 'Suicide Ratio per 100k',
    color = "Country Inclusion"
  ) +
  theme(legend.position = "right") +
  scale_color_manual(values = c("Without Russia" = "blue", "With Russia" = "red"))

The significant increase in suicide rates observed in Russia in 1995 can be attributed to a combination of several factors, each contributing to an overall sense of despair and instability within the population:

Economic Crisis: The period marked Russia’s challenging transition from a centrally planned economy to a free-market system. This shift resulted in substantial economic turmoil characterized by high unemployment rates, rampant inflation, and general economic uncertainty. As numerous studies have indicated, such economic hardships can greatly increase stress levels within the population, thereby leading to higher suicide rates.

Political Instability: The dissolution of the Soviet Union in 1991 precipitated a series of radical political and societal changes. The ensuing uncertainty and the resultant feeling of insecurity may have exacerbated the already volatile situation, thereby contributing to the rise in suicide rates.

Rise in Alcoholism: During this period, Russia experienced an increase in alcohol abuse, a problem that has historically been a challenge for the country. A well-established body of research shows a strong correlation between alcohol abuse and suicide rates. The spike in alcoholism during this time might have, therefore, been a significant contributor to the suicide rates links.

It is essential to note that while these factors are distinct, they are interrelated and likely exacerbated each other’s effects on the population’s mental health.

3.2 Geographical and Meteorological

3.2.1 Population

3.2.1.1 Univariate Analysis

data$scaled_population <- (data$population - min(data$population))/(max(data$population)-min(data$population))
data$scaled_log_population <- (data$log_population - min(data$log_population)) / (max(data$log_population) - min(data$log_population))
library(gridExtra)

# Original Population
p1 <- ggplot(data, aes(population)) +
  geom_histogram(binwidth=1000, fill="skyblue", color="black") +
  labs(title = "Histogram of Population",
       x = "",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Population",
       x = "",
       y = "")

# Scaled Population
p3 <- ggplot(data, aes(scaled_population)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Population",
       x = "",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Population",
       x = "",
       y = "")

# Scaled Log Population
p5 <- ggplot(data, aes(scaled_log_population)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Log Population",
       x = "",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Log Population",
       x = "",
       y = "")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

From these visualizations, we can observe that the distribution of the population variable becomes significantly less skewed after applying both min-max scaling and a log transformation. The resulting scaled log population appears more normally distributed and shows fewer outliers compared to the original and scaled population. Therefore, using the ‘scaled_log_population’ variable in our analyses should yield more robust results.

3.2.1.2 Bivariate Analysis

In this section, we investigate the correlation between our candidate target variables and scaled_log_population. Since our suicide_ratio variable was derived from the ratio of suicide_no to population, our analysis focuses on the relationship between scaled_log_population and the three transformations of suicide_no: namely, suicide_no, log_suicide_no, and sqrt_suicide_no.

data_long <- data %>%
  select(scaled_log_population, suicides_no, log_suicide_no, sqrt_suicide_no) %>%
  pivot_longer(cols = -scaled_log_population, names_to = "variable", values_to = "value")

ggplot(data_long, aes(x = scaled_log_population, y = value, color= "skyblue")) +
  geom_point(size=0.2) +
  facet_wrap(~variable, scales = "free_y") +
  labs(x = "Scaled Log Population",
       y = "Value",
       title = "Scatterplot of Scaled Log Population vs. Target Variables") + 
  theme(legend.position = "none") 

From these scatterplots, it appears that the relationship between scaled_log_population and log_suicide_no is more linear compared to the other variables.

# List of variables to calculate correlations with scaled_log_population
vars <- c("suicides_no", "log_suicide_no", "sqrt_suicide_no")

# Calculate correlations
correlations <- purrr::map_dbl(vars, ~cor(data$scaled_log_population, data[[.]]))

# Print correlations
names(correlations) <- vars
print(correlations)
    suicides_no  log_suicide_no sqrt_suicide_no 
      0.3930255       0.8622564       0.6715714 

Is it surprising that there is a high correlation between the transformation of population and suicide_no? Not at all!

Our goal is to estimate the probability of an individual taking their own life, which we calculate by dividing the suicide_no by the population in each row. However, this means we cannot directly examine the relationship between suicide_ratio and population, as we use the population directly to calculate the ratio. Instead, we need to analyze the relationship between population and suicide_no for each row.

When the population of each row increases, the probability of a higher suicide_no also tends to increase. However, this comparison may not be entirely accurate. Should we disregard population altogether? Before making that decision, let’s address this question: Is there a relationship between a country’s population and suicide_ratio? In other words, is the probability of an individual taking their own life higher in countries with a larger population?

To explore this, we have stratified the population of each country into three categories (big, medium, and small) for each year. This allows us to examine whether there is a notable difference in suicide_ratio based on the size of the population.

Before moving forward, let’s create a dataframe that calculates the population of each country over the years.

pop_data <- data %>%
  group_by(year, country) %>%
  summarize(population = sum(population, na.rm = TRUE), .groups = "drop")

We aim to establish thresholds for each year to classify countries into four categories: very small, small, medium, and large. It is important to identify natural gaps in the distribution of population for each country within a given year. It should be noted that a median-based approach is not suitable since it would result in three categories of equal size. Additionally, we need to compare the population of each country with the populations of other countries within the same year, as a country may not be densely populated at present but could experience a significant increase in population in the future.

To accomplish this, we employ the Jenks natural breaks classification method for each year. This method aims to minimize the variance within classes while maximizing the variance between classes. It involves an iterative process that reallocates observations from one class to another until an optimal arrangement is achieved.

# Load required library
library(classInt)

# Define a function to apply Fisher-Jenks method for binning
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Very_Small","Small", "Medium", "Large","Very_Large" ), include.lowest = TRUE)
}

# Add the new column to the data frame
pop_data <- pop_data %>%
  group_by(year) %>%
  mutate(population_bine_jenks = fisher_jenks(population))

# View the first few rows of the new data
head(pop_data)

Now, let’s examine the population categories we’ve created:

# Tabulate the categories
pop_distribution <- table(pop_data$population_bine_jenks)
print(pop_distribution)

Very_Small      Small     Medium      Large Very_Large 
      1714        249        202         85         31 

From the output, we notice that the category “Very Large” consists of only one country, the USA. To have more balanced categories, we’ll combine the USA with the “Large” category and rename accordingly:

# Adjust the categories
pop_data <- pop_data %>%
  mutate(population_bine_jenks = ifelse(population_bine_jenks == "Very_Large", "Large",
                                        as.character(population_bine_jenks)))

# Check the distribution after adjustment
adjusted_pop_distribution <- table(pop_data$population_bine_jenks)
print(adjusted_pop_distribution)

     Large     Medium      Small Very_Small 
       116        202        249       1714 
pop_data
# Join the population categories into our original data
data<- data %>%
  left_join(pop_data%>%
  select(country,year,population_bine_jenks),by = c("year", "country"))

We also bin the population based on the median and compare its performance with the Jenks natural breaks classification method in our model.

data_sum <- data %>%
  group_by(country, year) %>%
  summarise(population = sum(population))
`summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
thresholds <- quantile(data_sum$population, probs = c(0.25, 0.5, 0.75))

# Assign each country-year pair to a population category
data_binned <- data_sum %>%
  mutate(
    population_bine_median = case_when(
      population <= thresholds[1] ~ "Very_Small",
      population > thresholds[1] & population <= thresholds[2] ~ "Small",
      population > thresholds[2] & population <= thresholds[3] ~ "Medium",
      TRUE ~ "Large"
    )
  )

data <- data %>%
  left_join(data_binned%>%
  select(country,year,population_bine_median),by = c("year", "country"))

In this step, we will explore the relationship between log_suicide_ratio (chosen because it follows a normal distribution) and population_binde_jenks.

Let’s start by examining the descriptive statistics for both log_suicide_ratio and suicide_ratio.

data %>%
  group_by(population_bine_jenks) %>%
  summarise(
    mean_suicide_ratio = mean(log_suicide_ratio, na.rm = TRUE),
    median_suicide_ratio = median(log_suicide_ratio, na.rm = TRUE),
    min_suicide_ratio = min(log_suicide_ratio, na.rm = TRUE),
    max_suicide_ratio = max(log_suicide_ratio, na.rm = TRUE)
  )
data_grouped <- data %>%
  group_by(population_bine_jenks) %>%
  summarise(
    suicide_ratio = sum(suicides_no) / (sum(population) )
  )

ggplot(data_grouped, aes(x = population_bine_jenks, y = suicide_ratio, fill = population_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Population Bin", y = "Suicide Ratio", title = "Suicide Ratio for each Population Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

The above plot suggests that the suicide ratio in Large countires is much higher. Now we use statistical tests to see if this difference is statisticaly siginifticant. Since we have more than two samples we can use ANOVA. It’s important to keep in mind that ANOVA makes several assumptions, including that the residuals are normally distributed, that the variances are equal across groups, and that the observations are independent. We should check these assumptions before interpreting the results.

# Fit the model
pop_anova <- aov(log_suicide_ratio ~ population_bine_jenks, data = data)

# Run the ANOVA
anova_result <- anova(pop_anova)

# Print the result
print(anova_result)
Analysis of Variance Table

Response: log_suicide_ratio
                         Df  Sum Sq Mean Sq F value    Pr(>F)    
population_bine_jenks     3   486.3  162.10  126.64 < 2.2e-16 ***
Residuals             22806 29191.4    1.28                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Testing Normality of Residuals Assumption for ANOVA:

# Create a data frame for residuals
residuals_df <- data.frame(residuals = residuals(pop_anova))

# Create histogram of residuals
hist_plot <- ggplot(residuals_df, aes(x = residuals)) +
  geom_histogram(fill = 'steelblue', color = 'black', bins = 30) +
  theme_minimal() +
  labs(x = "Residuals", y = "Frequency",
       title = "Histogram of Residuals")

# Create Q-Q plot of residuals
qq_plot <- ggplot(residuals_df, aes(sample = residuals)) +
  geom_qq(color = 'steelblue') +
  geom_qq_line(color = 'red') +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

# Arrange the plots side by side using the gridExtra package
library(gridExtra)
grid.arrange(hist_plot, qq_plot, ncol = 2)

Testing Homogeneity of Variances Assumption for ANOVA:

leveneTest(log_suicide_ratio ~ age, data = data)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     4  55.038 < 2.2e-16 ***
      22805                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
bartlett.test(log_suicide_ratio ~ age, data = data)

    Bartlett test of homogeneity of variances

data:  log_suicide_ratio by age
Bartlett's K-squared = 250.28, df = 4, p-value < 2.2e-16

The small p-values in our tests lead us to reject the null hypothesis, which posits that the variances are equivalent across different groups. This implies that the results from an ANOVA test may not be reliable in our case. Therefore, we turn to the Kruskal-Wallis test.

The Kruskal-Wallis test is a non-parametric technique that assesses whether samples originate from the same distribution. It is suitable for comparing two or more independent samples of equal or differing sample sizes, effectively extending the Mann-Whitney U test, which is utilized for comparing only two groups.

Unlike the one-way ANOVA and t-tests, the Kruskal-Wallis test does not require the residuals to be normally distributed. Consequently, it can be used with continuous data that doesn’t follow a normal distribution. However, akin to ANOVA, it evaluates whether the mean ranks of the groups are different, not the means themselves.

The assumptions for the Kruskal-Wallis test are:

Independence: Observations within and between each sample should be independent, implying that one observation’s presence or absence doesn’t influence another observation’s presence or absence.

Ordinal Data: The dependent variable must be ordinal at a minimum, meaning that the observations can be ordered. It should be possible to say that one observation is greater than, equal to, or less than another observation.

Shape of Distribution: Although the Kruskal-Wallis test doesn’t require a specific data distribution like ANOVA does, it assumes that the shape of the distribution is identical for each group. While groups may have differing medians, their distribution’s overall shape should be the same.

For our data, we assume independence. Our ‘population_bin’ variable is ordinal, thereby fulfilling the second assumption. To verify the third assumption, we need to evaluate the distribution shapes via plots.

# Histogram
ggplot(data, aes(x =log_suicide_ratio)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~ population_bine_jenks)


# Density plot
ggplot(data, aes(x = log_suicide_ratio)) +
  geom_density() +
  facet_wrap(~ population_bine_jenks)


# Q-Q plot
ggplot(data, aes(sample = log_suicide_ratio)) +
  stat_qq() +
  facet_wrap(~ population_bine_jenks)

kruskal.test(log_suicide_ratio ~ population_bine_jenks, data = data)

    Kruskal-Wallis rank sum test

data:  log_suicide_ratio by population_bine_jenks
Kruskal-Wallis chi-squared = 347.32, df = 3, p-value < 2.2e-16

The result of the Kruskal-Wallis test aligns with that of the ANOVA. The obtained p-value is significantly small, indicating that the suicide rate differs significantly among different population categories.

#Factorizing our two new column population_bine_jenks and population_bine_median

data$population_bine_jenks <- factor(data$population_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Small",
                              "Small",
                              "Medium",
                              "Large"))
data$population_bine_median <- factor(data$population_bine_median, 
                   ordered = T, 
                   levels = c("Very_Small",
                              "Small",
                              "Medium",
                              "Large"))

3.2.2 Country and Continent

3.2.2.1 Univariate Analysis

# Count the number of observations for each country
country_counts <- data %>% 
  group_by(country) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

# Create the bar plot
ggplot(country_counts, aes(x = reorder(country, count), y = count)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  labs(x = "Country", y = "Number of Rows", title = "Number of Rows for Each Country") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15), 
        axis.title.y = element_text(size = 10),
        axis.title.x = element_text(size = 10),
        axis.text.x = element_blank()) # Remove country names

# Join the data to a map
map_data <- joinCountryData2Map(country_counts, joinCode = "NAME", nameJoinColumn = "country")
91 codes from your data successfully matched countries in the map
1 codes from your data failed to match with a country code in the map
152 codes from the map weren't represented in your data
# Set margins
par(mar = c(0, 0, 1, 0))

# Plot the map
mapCountryData(map_data, 
               nameColumnToPlot = "count", 
               mapTitle = "Number of Rows for each Country", 
               colourPalette = "blues",  # change color palette here
               oceanCol = "lightblue", 
               missingCountryCol = "grey65", 
               catMethod = "pretty")
You asked for 7 categories, 6 were used due to pretty() classification
Warning: colourPalette should be set to either a vector of colours or one of :white2Black black2White palette heat topo terrain rainbow negpos8 negpos9 
setting to heat colours as default

# Get the unique countries per continent in the data
continent_count <- data %>%
  group_by(continent) %>%
  summarise(num_countries_data = n_distinct(country))

# Actual country count per continent
actual_count <- data.frame(
  continent = c("Africa", "Asia", "Europe", "Americas", "Oceania"),
  num_countries_actual = c(54, 48, 44, 35, 14)  # replace with actual numbers
)

# Merge the two data frames
continent_count <- merge(continent_count, actual_count, by = "continent")

# Convert to long format for plotting
continent_count_long <- continent_count %>%
  pivot_longer(cols = -continent, 
               names_to = "category", 
               values_to = "count")

# Create a bar plot
# Then, create a bar plot
ggplot(continent_count_long, aes(x = continent, y = count, fill = category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Continent", y = "Number of Countries",
       title = "Actual Countries vs Countries in Data per Continent") +
  scale_fill_discrete(name = "", labels = c("Number of Countries in Our Data", "Actual Number of Countries")) +
  theme_minimal()

The presented plots reveal that we have a limited number of samples from Africa, which should be taken into consideration when interpreting the results for this region. Additionally, it is important to note that our data for Asia is also relatively sparse, which may impact the robustness of any conclusions drawn for this continent.

3.2.2.2 Bivariate Analysis

country <- data %>%
  group_by(country, continent) %>%
  summarize(n = n(), 
            suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,
            .groups="drop") %>%
  arrange(desc(suicide_per_100k))

country$country <- factor(country$country, 
                          ordered = T, 
                          levels = rev(country$country))

ggplot(country, aes(x = country, y = suicide_per_100k, fill = continent)) + 
  geom_bar(stat = "identity") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Global suicides per 100k, by Country",
       x = "Country", 
       y = "Suicides per 100k", 
       fill = "Continent") +
  #coord_flip() +
  scale_y_continuous(breaks = seq(0, 45, 2)) + 
  theme(
  legend.position = "top",
  legend.key.size = unit(0.25, "cm"),
  plot.title = element_text(hjust = 0.5),
  axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1, size= 4),
  #axis.line.x = element_line(inherit.blank = TRUE)
  ) 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.

Lithuania’s rate has been highest by a large margin: > 44 suicides per 100k (per year)

country <- data %>%
  group_by(country) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,
            .groups = "drop")

countrydata <- joinCountryData2Map(country, joinCode = "NAME", nameJoinColumn = "country")
91 codes from your data successfully matched countries in the map
1 codes from your data failed to match with a country code in the map
152 codes from the map weren't represented in your data
par(mar=c(0, 0, 2, 0)) # margins

mapCountryData(countrydata, 
nameColumnToPlot="suicide_per_100k", 
mapTitle="Sucide per 100k across the Globe", 
colourPalette = "heat", 
oceanCol="lightblue", 
missingCountryCol="grey65", 
catMethod = "pretty")
You asked for 7 categories, 10 were used due to pretty() classification

mapCountryData(countrydata, 
nameColumnToPlot="suicide_per_100k", 
mapTitle="Suicides per 100k in Eurasia", 
mapRegion = "eurasia", 
colourPalette = "heat", 
oceanCol="lightblue", 
missingCountryCol="grey65", 
addLegend = FALSE, 
catMethod = "pretty")
You asked for 7 categories, 10 were used due to pretty() classification

It’s essential to be aware of our data’s limitations. Specifically, we’re missing a significant amount of information for Africa and Asia. On top of that, eight countries were excluded due to lack of sufficient data.

Therefore, our analyses, whether on a global or continent level, might not provide a fully accurate picture. We’re essentially trying to make sense of a puzzle with missing pieces.

Lastly, when comparing suicide rates between different countries, it’s crucial to consider that what is recorded as a suicide can vary by country. The reliability of suicide reporting can also influence our results.

So, even though our analysis can help identify some trends, we must keep these caveats in mind when interpreting our findings.

Due to the limited availability of data from Africa, we have excluded this continent from the current analysis.


data_continent <- data %>% filter(continent != "Africa")

# Function to calculate suicide rate per 100k
calculate_suicide_rate <- function(suicides_no, population) {
  (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000
}

# Calculating suicide rates by continent
continent <-data_continent %>%
  group_by(continent) %>%
  summarize(suicide_per_100k = calculate_suicide_rate(suicides_no, population)) %>%
  arrange(suicide_per_100k)


continent_plot <- ggplot(continent, aes(x = continent, y = suicide_per_100k, fill = continent)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Suicide Rates per 100k by Continent",
       x = "Continent", 
       y = "Suicides per 100k", 
       fill = "Continent") +
  theme(legend.position = "none") + 
  scale_y_continuous(breaks = seq(0, 20, 1), minor_breaks = F)
print(continent_plot)

Our preliminary analysis suggests that Europe appears to have a higher suicide rate compared to other continents. To further substantiate this observation, we should validate it with rigorous statistical tests. After all, our dataset is merely a sample and doesn’t necessarily represent the whole population.

Hence, we need to investigate if the observed differences are statistically significant.

Our null hypothesis (H0) states that the mean suicide rates are identical across all continents over the 30-year span.

On the other hand, our alternative hypothesis (H1) asserts that at least one continent has a distinct mean suicide rate compared to the others.

We set our significance level at 0.05 for these tests.

For comparison, we’ll initially employ the ANOVA test. If our data does not satisfy the assumptions required for ANOVA, we’ll resort to the non-parametric Kruskal-Wallis test.

continent_anova <- aov(log_suicide_ratio ~ continent, data = data_continent)

# Run the ANOVA
anova_result <- anova(continent_anova)

# Print the result
print(anova_result)
Analysis of Variance Table

Response: log_suicide_ratio
             Df  Sum Sq Mean Sq F value    Pr(>F)    
continent     3  1098.9  366.30  302.69 < 2.2e-16 ***
Residuals 22116 26763.3    1.21                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
kruskal.test(log_suicide_ratio ~ continent, data = data_continent)

    Kruskal-Wallis rank sum test

data:  log_suicide_ratio by continent
Kruskal-Wallis chi-squared = 801.43, df = 3, p-value < 2.2e-16

The p-values derived from both tests are significantly small, leading us to reject the null hypothesis. Thus, we can infer that at least one continent has a distinctive suicide rate. However, the ANOVA test does not provide insights about which specific continent diverges, nor does it indicate the extent of this difference.

To delve into these specifics, we’ll utilize the Tukey’s Honest Significant Difference (HSD) post-hoc test. This test will facilitate the identification of groups with significantly different means, providing a comprehensive understanding of our data.

TukeyHSD(continent_anova)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = log_suicide_ratio ~ continent, data = data_continent)

$continent
                        diff         lwr         upr     p adj
Asia-Americas    -0.14199557 -0.19592998 -0.08806116 0.0000000
Europe-Americas   0.37860342  0.33504186  0.42216498 0.0000000
Oceania-Americas  0.44550890  0.34106451  0.54995330 0.0000000
Europe-Asia       0.52059899  0.46851548  0.57268251 0.0000000
Oceania-Asia      0.58750447  0.47922837  0.69578058 0.0000000
Oceania-Europe    0.06690548 -0.03659526  0.17040622 0.3446553

Our initial hypothesis stands correct as the data reveals that Europe exhibits a higher suicide rate compared to other regions. Now, let’s explore if different regions within Europe exhibit distinct patterns in terms of suicide rates.

# Define a list of countries for each region
data_europe <- data %>%
  filter(continent == "Europe")

northern <- c("Denmark", "Estonia", "Finland", "Iceland", "Ireland", "Latvia", "Lithuania", "Norway", "Sweden", "United Kingdom","Luxembourg")
southern <- c("Greece", "Italy", "Portugal", "Spain","Malta")
eastern <- c("Bulgaria", "Czech Republic", "Hungary", "Poland", "Romania", "Russian Federation", "Slovakia","Albania","Ukraine","Belarus","Montenegro","Croatia", "Serbia", "Slovenia")
western <- c("Austria", "Belgium", "France", "Germany", "Netherlands", "Switzerland")


# Add a new column 'region' based on the country
data_europe$region <- case_when(
  data_europe$country %in% northern ~ "Northern",
  data_europe$country %in% southern ~ "Southern",
  data_europe$country %in% eastern ~ "Eastern",
  data_europe$country %in% western ~ "Western",
  TRUE ~ NA_character_ # for countries not listed above
)
# Group the data by region and calculate the average suicide rate per 100k and the number of unique countries
region_data <- data_europe %>%
  group_by(region) %>%
  summarise(avg_suicide_rate_per100k = sum(suicides_no, na.rm = TRUE) / sum(population) * 1e5,
            num_countries = n_distinct(country)) 

# Create the bar plot
ggplot(region_data, aes(x = reorder(region, -avg_suicide_rate_per100k), y = avg_suicide_rate_per100k, fill = region)) +
  geom_col() +
  scale_fill_brewer(palette = "Spectral") +
  labs(x = "Region", y = "Average suicide rate per 100k", 
       title = "Average Suicide Rates per 100k for Regions in Europe") +
  theme_minimal() +
  theme(legend.position = "none")

The presented plot indicates that the suicide ratio in Eastern Europe is notably higher compared to other regions. This finding aligns with the observations made in the previous sections regarding the countries within this region.

3.2.2.3 Multivariate Analyisis

Our goal is to understand the temporal trends in suicide rates for each country. Rather than creating visualizations for all 93 countries, we adopt a more streamlined approach by fitting a linear regression model to the data for each country. This allows us to identify patterns of increase or decrease in suicide rates over time.

Specifically, we’re interested in the ‘year’ coefficient in our linear models. This coefficient signifies the rate of change in suicide rates over time. To control for multiple comparisons, we only consider those countries with a corrected p-value less than 0.05.

To summarize, we are identifying countries where there’s a statistically significant linear trend in suicide rates over time. These trends are then rank-ordered based on their rate of change, providing a clear picture of where suicide rates are increasing or decreasing most rapidly.

# Create a summary data frame, grouping by country and year
country_year <- data %>%
  group_by(country, year) %>%
  summarize(suicides = sum(suicides_no), 
            population = sum(population), 
            suicide_per_100k = (suicides / population) * 100000, 
            gdp_per_capita = mean(GDP_per_capita),
            .groups = "drop")  # Prevents the warning about groups

# Fit a linear model for each country, tidy the output, and filter for significant trends
country_year_trends <- country_year %>%
  nest(data = -country) %>%  # Use explicit naming to prevent warning
  mutate(model = map(data, ~ lm(suicide_per_100k ~ year, data = .)),
         tidied = map(model, broom::tidy)) %>%
  unnest(cols = c(tidied))

# Adjust p-values and filter for significant results
country_year_sig_trends <- country_year_trends %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value, method = "holm")) %>%
  filter(p.adjusted < .05) %>%
  arrange(estimate)

# Make country an ordered factor
country_year_sig_trends <- mutate(country_year_sig_trends, country = factor(country, ordered = TRUE, levels = country))
ggplot(country_year_sig_trends, aes(x=country, y=estimate, col = estimate)) + 
  geom_point(stat='identity', size = 2) +
  geom_hline(yintercept = 0, col = "grey", size = 1) +
  scale_color_gradient(low = "green", high = "red") +
  geom_segment(aes(y = 0, 
                   x = country, 
                   yend = estimate, 
                   xend = country), size = 1) +
  labs(title="Change per year (Suicides per 100k)", 
       x = "Country", y = "Change Per Year (Suicides per 100k)") +
  scale_y_continuous(breaks = seq(-2, 2, 0.2), limits = c(-1.5, 1.5)) +
  theme(legend.position = "none",
        axis.text.y = element_text(size= 5)) +
  coord_flip()

Approximately half of the countries (48 out of 96) exhibit a linear change in suicide rates as time progresses. Among these 48 countries, 32 of them (about two-thirds) show a decreasing trend. Overall, this trend presents a positive picture. However, it is worth noting that the suicide rates in Guyana and Korea are a cause for concern as they display concerning patterns.

### Lets look at those countries with the steepest increasing trends

top12_increasing <- tail(country_year_sig_trends$country, 12)

country_year %>%
  filter(country %in% top12_increasing) %>%
  ggplot(aes(x = year, y = suicide_per_100k, col = country)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  facet_wrap(~ country) + 
  theme(legend.position = "none") + 
  labs(title="12 Steepest Increasing Trends", 
       subtitle="Of countries with significant trends (p < 0.05)", 
       x = "Year", 
       y = "Suicides per 100k")

The historical data for Guyana raises concerns due to a seemingly improbable jump in the suicide rate. While Guyana is known for having high suicide rates, the sudden increase observed appears questionable. It is possible that changes in how suicide cases were classified or reported could have influenced this significant surge in the dat

continent_time <- data_continent %>%
  group_by(year, continent) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000, .groups="drop")

continent_time$continent <- factor(continent_time$continent, ordered = T, levels = continent$continent)

continent_time_plot <- ggplot(continent_time, aes(x = year, y = suicide_per_100k, col = factor(continent))) + 
  facet_grid(continent ~ ., scales = "free_y") + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time, by Continent", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Continent") + 
  theme(legend.position = "none", title = element_text(size = 10)) + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)
print(continent_time_plot)

  • Europe, having the highest suicide rate overall, has experienced a consistent decline of approximately 40% since 1995.
  • By 2015, Europe’s suicide rate had converged with that of Asia and Oceania, showing similar levels.
  • In contrast to the global downward trend, Oceania and the Americas demonstrate an upward trajectory in suicide rates. This distinct pattern is concerning and calls for a thorough investigation into the underlying factors contributing to this rise, as well as the implementation of effective interventions.

3.2.3 Tempurate

We have three variables: avg_temp, max_temp, and min_temp. These variables represent the average, maximum, and minimum temperatures, respectively, for each country in each year.

3.2.3.1 Univariate Analysis

library(gridExtra)
# Function to create a list of plots for each variable
create_plots <- function(variable_name, data) {
  p1 <- ggplot(data, aes_string(variable_name)) +
    geom_boxplot() 

  p2 <- ggplot(data, aes_string(variable_name)) +
    geom_histogram(bins = 30, fill = "steelblue", color = "white") 

  p3 <- ggplot(data, aes_string(variable_name)) +
    geom_density(fill = "steelblue")

  list(p1, p2, p3)
}

# Create plots for each temperature variable
avg_temp_plots <- create_plots("avg_temp", data)
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
Please use tidy evaluation idioms with `aes()`. 
See also `vignette("ggplot2-in-packages")` for more information.
min_temp_plots <- create_plots("min_temp", data)
max_temp_plots <- create_plots("max_temp", data)

# Arrange the plots in a grid
grid.arrange(grobs = c(avg_temp_plots, min_temp_plots, max_temp_plots), ncol = 3)

3.2.3.2 Bivariate Analysis

Before calculating the correlation between temperature variables and suicide ratio, it is important to remove extreme outliers from the dataset.

library(gridExtra)
library(knitr)

# Function to process each temperature variable
correlation_plots <- function(temp_var, data) {
  
  # Calculate and print the correlation before removing outliers
  corr_before <- cor(data[[temp_var]], data$suicide_ratio, use = "complete.obs")
  
  # Calculate quartiles and IQR
  Q1 <- quantile(data[[temp_var]], 0.25, na.rm = TRUE)
  Q3 <- quantile(data[[temp_var]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  # Identify outliers
  outliers <- data[[temp_var]] < (Q1 - 1.5 * IQR) | data[[temp_var]] > (Q3 + 1.5 * IQR)
  
  # Remove outliers
  data_no_outliers <- data[!outliers, ]
  
  # Calculate and print the correlation after removing outliers
  corr_after <- cor(data_no_outliers[[temp_var]], data_no_outliers$suicide_ratio, use = "complete.obs")
  
  # Create scatter plot with a regression line
  plot <- ggplot(data_no_outliers, aes_string(x = temp_var, y = "suicide_ratio")) +
    geom_point(color="skyblue", size=0.5) +
    geom_smooth(method = lm, color = "pink") +
    labs(x = paste("Temperature (", temp_var, ")", sep = ""), y = "Suicide Ratio", 
         title = temp_var) +
    theme_minimal()
  
  return(list(corr_before = corr_before, corr_after = corr_after, plot = plot))
}

# Call the function for each temperature variable
result_avg_temp <- correlation_plots("avg_temp", data)
result_min_temp <- correlation_plots("min_temp", data)
result_max_temp <- correlation_plots("max_temp", data)

# Combine the plots into a grid
grid.arrange(result_avg_temp$plot, result_min_temp$plot, result_max_temp$plot, nrow=1, ncol=3)

# Combine the correlation results into a data frame and display as a table
correlation_results <- data.frame(
  Temperature_Variable = c("avg_temp", "min_temp", "max_temp"),
  Correlation_Before = c(result_avg_temp$corr_before, result_min_temp$corr_before, result_max_temp$corr_before),
  Correlation_After = c(result_avg_temp$corr_after, result_min_temp$corr_after, result_max_temp$corr_after)
)

print(correlation_results)
cor(data$min_temp,data$sqrt_suicide_ratio)
[1] -0.353432
cor(data$min_temp,data$log_suicide_ratio)
[1] -0.1532334

It appears that the temperature variables exhibit a stronger correlation with the square root of the suicide ratio (sqrt_suicide_ratio). This observation is noteworthy and should be taken into account during further analysis.

Now, similar to the population variable, let’s bin the temperature variables. It is important to note that when using the Jenks method, we classify the temperatures for each year separately. This approach helps reduce the influence of temperature variations over the years and provides a more accurate assessment of the temperature categories.

temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(avg_temp), .groups = "drop")
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(avg_temp_bine_jenks = fisher_jenks(temp))

# View the new data
head(temp_data)
temp_data
table(temp_data$avg_temp_bine_jenks)

 Freezing Very_Cold      Cold      Warm       Hot 
       77       301       697       375       831 

Considering the limited number of countries falling under the “Freezing” temperature category, it would be appropriate to combine the “Very Cold” and “Freezing” categories into a single category. This consolidation ensures that the temperature classification remains meaningful and representative despite the scarcity of data points in the “Freezing” category

temp_data <- temp_data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(avg_temp_bine_jenks)))
table(temp_data$avg_temp_bine_jenks)

     Cold       Hot Very_Cold      Warm 
      697       831       378       375 
# lets join this dataset to original dataset 
data<- data %>%
  left_join(temp_data%>%
  select(country,year,avg_temp_bine_jenks),by = c("year", "country"))

Now, let’s compare the suicide ratio in each climate category.

data_grouped <- data %>%
  group_by(avg_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )
  
ggplot(data_grouped, aes(x = avg_temp_bine_jenks, y = suicide_ratio, fill = avg_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Average temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each Climate Category") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

From the plots, it is evident that countries with hot and warm climates exhibit significantly lower suicide rates. Moreover, the plot suggests that we can simplify the climate categories into just two groups: “Warm” and “Not Warm” since the suicide rates in these two categories are similar.

data <- data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Hot", "Warm",
                                      as.character(avg_temp_bine_jenks)))
data <- data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Very_Cold", "Cold",
                                      as.character(avg_temp_bine_jenks)))

We can apply the same procedure of analysis to the variables of maximum temperature and mean temperature as well.

# MIN TEMP
temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(min_temp), .groups = "drop")

fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(min_temp_bine_jenks = fisher_jenks(temp))

# since we have very low country for very_cold we unify very cold and cold also 
temp_data <- temp_data %>%
  mutate(min_temp_bine_jenks = ifelse(min_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(min_temp_bine_jenks)))

# Join to the data
data<- data %>%
  left_join(temp_data%>%
  select(country,year,min_temp_bine_jenks),by = c("year", "country"))
data_grouped <- data %>%
  group_by(min_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )

ggplot(data_grouped, aes(x = min_temp_bine_jenks, y = suicide_ratio, fill = min_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "min temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each min temp Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

data <- data %>%
  mutate(min_temp_bine_jenks = ifelse(min_temp_bine_jenks == "Warm", "Hot", as.character(min_temp_bine_jenks)))
# Max Temp

temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(max_temp))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(max_temp_bine_jenks = fisher_jenks(temp))

temp_data <- temp_data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(max_temp_bine_jenks)))

data<- data %>%
  left_join(temp_data%>%
  select(country,year,max_temp_bine_jenks),by = c("year", "country"))
data_grouped <- data %>%
  group_by(max_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )

ggplot(data_grouped, aes(x = max_temp_bine_jenks, y = suicide_ratio, fill = max_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "max temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each max temp Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

NA
NA
data <- data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Hot", "Warm", as.character(max_temp_bine_jenks)))
data <- data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Very_Cold", "Cold", as.character(max_temp_bine_jenks)))

3.3 Social and Economical

3.3.1 GDP

3.3.1.1 Univariate Analysis

# Scale the GDP and log GDP variables
data$scaled_GDP_for_year <- (data$GDP_for_year - min(data$GDP_for_year))/(max(data$GDP_for_year)-min(data$GDP_for_year))
data$scaled_log_GDP_year <- (data$log_GDP_year - min(data$log_GDP_year)) / (max(data$log_GDP_year) - min(data$log_GDP_year))
library(gridExtra)
# Plotting original GDP
p1 <- ggplot(data, aes(GDP_for_year)) +
  geom_histogram(binwidth=1000000000, fill="skyblue", color="black") +
  labs(title = "Histogram of GDP_for_year",
       x = "GDP_for_year",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = GDP_for_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of GDP_for_year",
       x = "",
       y = "GDP_for_year")

# Plotting scaled GDP
p3 <- ggplot(data, aes(scaled_GDP_for_year)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of scaled_GDP_for_year",
       x = "scaled_GDP_for_year",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_GDP_for_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of scaled_GDP_for_year",
       x = "",
       y = "scaled_GDP_for_year")

# Plotting scaled log GDP
p5 <- ggplot(data, aes(scaled_log_GDP_year)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of scaled_log_GDP_year",
       x = "scaled_log_GDP_year",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_GDP_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of scaled_log_GDP_year",
       x = "",
       y = "scaled_log_GDP_year")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

# Computing scaled and log-scaled GDP per capita
data$scaled_GDP_per_capita <- (data$GDP_per_capita - min(data$GDP_per_capita)) / 
                              (max(data$GDP_per_capita) - min(data$GDP_per_capita))

data$scaled_log_GDP_capita <- (data$log_GDP_capita-min(data$log_GDP_capita))/
                              (max(data$log_GDP_capita)-min(data$log_GDP_capita)) 
# Loading required library
library(gridExtra)

# Original GDP per Capita
p1 <- ggplot(data, aes(GDP_per_capita)) +
  geom_histogram(binwidth=1000, fill="skyblue", color="black") +
  labs(title = "Histogram of GDP_per_Capita",
       x = "GDP_per_Capita",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = GDP_per_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of GDP_per_Capita",
       x = "",
       y = "GDP_per_Capita")

# Scaled GDP per Capita
p3 <- ggplot(data, aes(scaled_GDP_per_capita)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled GDP_per_Capita",
       x = "Scaled GDP_per_Capita",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_GDP_per_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled GDP_per_Capita",
       x = "",
       y = "Scaled GDP_per_Capita")

# Scaled Log GDP per Capita
p5 <- ggplot(data, aes(scaled_log_GDP_capita)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Log GDP_per_Capita",
       x = "Scaled Log GDP_per_Capita",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_GDP_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Log GDP_per_Capita",
       x = "",
       y = "Scaled Log GDP_per_Capita")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

3.3.1.2 Bivariate Analysis

# Prepare the data in long format
long_data <- data %>% 
  select(scaled_log_GDP_year, suicide_ratio, log_suicide_ratio, sqrt_suicide_ratio) %>%
  gather(key = "Variable", value = "Value", -scaled_log_GDP_year)

# Create the plot
ggplot(long_data, aes(x = scaled_log_GDP_year, y = Value)) +
  geom_point(color = "skyblue", size=0.5) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  ggtitle("Scatterplots of scaled_log_GDP_year vs. different variables") +
  theme_minimal() +
  ylab("") +
  xlab("scaled_log_GDP_year")

cor(data$scaled_log_GDP_year, data$suicide_ratio)
[1] 0.1049844
cor(data$scaled_log_GDP_year, data$log_suicide_ratio)
[1] -0.06932113
cor(data$scaled_log_GDP_year, data$sqrt_suicide_ratio)
[1] 0.2291764
# Prepare the data in long format
long_data <- data %>% 
  select(scaled_log_GDP_capita, suicide_ratio, log_suicide_ratio, sqrt_suicide_ratio) %>%
  gather(key = "Variable", value = "Value", -scaled_log_GDP_capita)

# Create the plot
ggplot(long_data, aes(x = scaled_log_GDP_capita, y = Value)) +
  geom_point(color = "skyblue", size=0.5) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  ggtitle("Scatterplots of scaled_log_GDP_year vs. different variables") +
  theme_minimal() +
  ylab("") +
  xlab("scaled_log_GDP_year")

cor(data$scaled_log_GDP_capita, data$suicide_ratio)
[1] 0.01539712
cor(data$scaled_log_GDP_capita, data$log_suicide_no)
[1] 0.1347893
cor(data$scaled_log_GDP_capita, data$sqrt_suicide_ratio)
[1] 0.06002621
cor(data$GDP_per_capita, data$suicide_ratio)
[1] 0.002272913
cor(data$GDP_per_capita, data$log_suicide_no)
[1] 0.1111765
cor(data$GDP_per_capita, data$sqrt_suicide_ratio)
[1] 0.05862302

I am surprised to discover that there seems to be no evident influence between GDP and suicide rates. Let’s further investigate this matter.

Considering the fact that GDP tends to increase over the years for countries, it becomes clear that GDP alone may not be a reliable indicator of a country’s overall wealth or prosperity.

To gain deeper insights, we can introduce a new column called “gpd_pro_cap,” which represents the share of each individual within each cluster divided by the sum of the GDP values for all countries. This calculation provides an estimation of the share of each individual within each cluster relative to the total GDP of the world during those respective years.By incorporating this new measure, we aim to capture the average share of each individual within each cluster in each year, accounting for the global GDP. This approach allows us to evaluate the relative economic position of individuals within their respective clusters over time.

Let’s analyze the correlation between the average GDP per capita and the years.

# Compute average GDP_per_capita for each year
df_yearly <- data %>%
  group_by(year) %>%
  summarise(avg_GDP_per_capita = mean(GDP_per_capita, na.rm = TRUE))

correlation <- cor(df_yearly$year, df_yearly$avg_GDP_per_capita)
print(correlation)
[1] 0.9450579

To address this issue, we can categorize GDP per capita similar to how we categorized the population variable earlier.

gdp_data <- data %>%
  group_by(year, country) %>%
  summarize(income = mean(GDP_per_capita, na.rm = TRUE), .groups = "drop")
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 4, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Very_Low_income","Low_income", "Medium_income", "High_income" ), include.lowest = TRUE)
}

# Add the new column to the data frame
gdp_data <- gdp_data %>%
  group_by(year) %>%
  mutate(gdp_per_capita_bine_jenks = fisher_jenks(income))

# View the new data
table(gdp_data$gdp_per_capita_bine_jenks)

Very_Low_income      Low_income   Medium_income     High_income 
           1259             420             457             145 
data<- data %>%
  left_join(gdp_data%>%
  select(country,year,gdp_per_capita_bine_jenks),by = c("year", "country"))
data_sum <- data %>%
  group_by(country, year) %>%
  summarise(GDP_per_capita = mean(GDP_per_capita), .groups = "drop")

thresholds <- quantile(data_sum$GDP_per_capita, probs = c(0.25, 0.5, 0.75))

# Assign each country-year pair to a population category
data_binned <- data_sum %>%
  mutate(
    gdp_bine_median = case_when(
      GDP_per_capita <= thresholds[1] ~ "Very_low_income",
      GDP_per_capita > thresholds[1] & GDP_per_capita <= thresholds[2] ~ "low_income",
      GDP_per_capita > thresholds[2] & GDP_per_capita <= thresholds[3] ~ "Medium_income",
      TRUE ~ "high_income"
    )
  )

data <- data %>%
  left_join(data_binned%>%
  select(country,year,gdp_bine_median),by = c("year", "country"))

By using the GDP binning method (gdp_bine_jenks) for each year, we mitigate the impact of the increasing GDP over time. Now, let’s examine the suicide ratio within each GDP category to gain further insights.

# Calculate the mean suicide_ratio for each group
mean_suicide_ratio <- data %>%
  group_by(gdp_per_capita_bine_jenks) %>%
  summarise(mean_suicide_ratio = mean(suicide_ratio))

# Print the mean_suicide_ratio dataframe
print(mean_suicide_ratio)
# Plot the mean_suicide_ratio
ggplot(mean_suicide_ratio, aes(x = gdp_per_capita_bine_jenks, y = mean_suicide_ratio, fill = gdp_per_capita_bine_jenks)) +
  geom_col(show.legend = FALSE) + # Remove the color legend
  scale_fill_brewer(palette = "Set2") + # Change the color palette
  labs(title = "Mean Suicide Ratio by GDP Group",
       x = "GDP Group (Jenks Natural Breaks)",
       y = "Mean Suicide Ratio") +
  theme_minimal() + # Use a clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability

It appears that countries with lower GDP per capita tend to exhibit lower suicide rates. This observation suggests a possible inverse relationship between the economic status of a country and its suicide rate.

3.3.1.3 Multivariate Analysis

In this section, we aim to explore the relationship between a country’s economic prosperity and its suicide rate. Specifically, we investigate the question: “As a country gets richer, does its suicide rate decrease?”

It depends on the country - for almost every country, there is a high correlation between year and gdp per capita, i.e. as time goes on, gdp per capita linearly increases.

country_year_gdp <- data %>%
  group_by(country, year) %>%
  summarize(GDP_per_capita = mean(GDP_per_capita), .groups = "drop")
  
country_year_gdp_corr <- country_year_gdp %>%
  ungroup() %>%
  group_by(country) %>%
  summarize(year_gdp_correlation = cor(year, GDP_per_capita), .groups = "drop")

In our analysis, we examined the relationship between ‘year’ and ‘GDP per capita’ within individual countries by calculating the Pearson correlations. The results were intriguing: the mean correlation was 0.878, indicating a very strong positive linear relationship. Essentially, this suggests that an increase in wealth per person within a country is correlated with an increase in the country’s suicide rate over time.

However, it’s crucial to note that these trends are not uniform across all countries. While some countries show an increase in suicide rates over time, most are actually experiencing a decrease.

This leads us to ask a slightly different but equally significant question: Do wealthier countries have higher suicide rates? To explore this, we calculated the mean GDP per capita across all available years for each country, then compared this with the average suicide rate over the same period. This approach provides us with a single data point for each country, offering a general impression of a nation’s affluence and its suicide rate.

country_mean_gdp <- data %>%
  group_by(country, continent) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000, 
            gdp_per_capita = mean(GDP_per_capita), .groups = "drop")

ggplot(country_mean_gdp, aes(x = gdp_per_capita, y = suicide_per_100k, col = continent)) + 
  geom_point() + 
  scale_x_continuous(labels=scales::dollar_format(prefix="$"), breaks = seq(0, 70000, 10000)) + 
  labs(title = "Correlation between GDP (per capita) and Suicides per 100k", 
       subtitle = "Plot containing every country",
       x = "GDP (per capita)", 
       y = "Suicides per 100k", 
       col = "Continent") 

A number of countries in our dataset exhibit high leverage and residuals, potentially influencing the fit of our regression line. A notable example is Lithuania, which is situated in the top left of our graph. To mitigate this impact, we’ll apply Cook’s Distance as a measure to identify and exclude outliers. We will exclude those countries with a Cook’s Distance value greater than 4/n, which is a common threshold.

After implementing this adjustment, we’ll examine the revised model, now free of outliers, to better understand its statistical properties.

model1 <- lm(suicide_per_100k ~ gdp_per_capita, data = country_mean_gdp)

gdp_suicide_no_outliers <- model1 %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  filter(.cooksd < 4/nrow(.)) %>% # removes 5/93 countries
  inner_join(country_mean_gdp, by = c("suicide_per_100k", "gdp_per_capita")) %>%
  select(country, continent, gdp_per_capita, suicide_per_100k)

model2 <- lm(suicide_per_100k ~ gdp_per_capita, data = gdp_suicide_no_outliers)

summary(model2)

Call:
lm(formula = suicide_per_100k ~ gdp_per_capita, data = gdp_suicide_no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.347  -6.413  -2.431   5.619  25.158 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.182e+01  1.378e+00   8.579 3.27e-13 ***
gdp_per_capita 8.095e-05  6.208e-05   1.304    0.196    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.102 on 87 degrees of freedom
Multiple R-squared:  0.01917,   Adjusted R-squared:  0.007895 
F-statistic:   1.7 on 1 and 87 DF,  p-value: 0.1957

Based on our analysis, we cannot reject the null hypothesis, which suggests that there is no linear association between the suicide rate per 100,000 population and GDP per capita for each country. However, we anticipate that when we incorporate these variables with other factors, it may reveal a linear association. We will further explore this relationship in the upcoming model chapter.

Check if the new features made any problem in the dataset.

unfactorized_vars <- function(df) {
  var_names <- names(df)
  unfactorized <- var_names[sapply(df, function(x) is.character(x) | is.integer(x))]
  return(unfactorized)
}

# Testing the function
unfactorized_vars(data)
[1] "year"                "avg_temp_bine_jenks" "min_temp_bine_jenks" "max_temp_bine_jenks"
[5] "gdp_bine_median"    

lets factorize them

data$avg_temp_bine_jenks <- factor(data$avg_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Cold","Warm"))

data$min_temp_bine_jenks <- factor(data$min_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Cold","Cold","Hot"))

data$max_temp_bine_jenks <- factor(data$max_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Cold","Warm"))

data$gdp_per_capita_bine_jenks <- factor(data$gdp_per_capita_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Low_income",
                              "Low_income", 
                              "Medium_income", 
                              "High_income"))
data$gdp_bine_median <- factor(data$gdp_bine_median, 
                   ordered = T, 
                   levels = c("Very_low_income",
                              "low_income", 
                              "Medium_income", 
                              "high_income"))
null_percentage <- function(df) {
  # Calculates the percentage of null values in each column of a dataframe

  # Get the number of nulls in each column
  nulls <- sapply(df, function(x) sum(is.na(x)))

  # Calculate the percentage
  percentages <- nulls / nrow(df) * 100

  # Return the result as a data frame for easier viewing
  return(data.frame(Column = names(df), NullPercentage = percentages))
}

# Usage:
null_percentage(data)
NA

3.4 Demographic Variables

For this specific part, we have data on age, generation, and sex variables. It is important to emphasize that our dataset is well-distributed among each sex and age group. Each sex and age bound is represented by a single row in our dataset, ensuring comprehensive coverage across different demographic categories.

3.3.1 Univariate Anaylysis

Given that our data is well-distributed across different sexes and age groups, we can proceed to visualize a bar plot for the generation variable. This will provide a visual representation of how the data is distributed among different generations.

# Define common theme for all plots
common_theme <- theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Bar plot for Generation
generation_bar_plot <- data %>%
  ggplot(aes(x = generation, fill = generation)) +
  geom_bar() +
  labs(title = "Bar Plot of Generation",
       x = "Generation",
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  common_theme


# Arrange plots
grid.arrange(generation_bar_plot, ncol = 1)

3.3.2 Bivariate Analysis

# Define the function
create_suicide_rate_plot <- function(group_var) {
  data %>%
    group_by(!!sym(group_var)) %>%
    summarize(suicide_per_100k = (sum(suicides_no) / sum(population)) * 100000) %>%
    ggplot(aes_string(x = group_var, y = "suicide_per_100k", fill = group_var)) +
    geom_col() +
    labs(
      #title = group_var, 
      x = group_var,
      y = ""
    ) +
    theme_minimal() +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 1),
      axis.text.x = element_text(angle = 0, hjust = 0.5, vjust = 1, size= 4),
      axis.line.x = element_line(inherit.blank = TRUE)
    ) +
    coord_cartesian(ylim = c(0, 30)) + 
    scale_fill_brewer(palette = "Set2")
}

theme_update(plot.title = element_blank(), axis.title.y = element_blank())

sex_plot <- create_suicide_rate_plot("sex")
age_plot <- create_suicide_rate_plot("age")
generation_plot <- create_suicide_rate_plot("generation")

# Arrange the plots
grid.arrange(
  top = textGrob("Global suicides per 100k", gp=gpar(fontsize=16, fontface="bold")),
  left = textGrob("Suicides per 100k", rot=90, gp=gpar(fontsize=16, fontface="bold")),
  arrangeGrob(sex_plot, age_plot, generation_plot, ncol=3)
)

Based on the plots, we observe that suicides are more prevalent among men and the age group of 75 years and older. Regarding the generation variable, it appears that suicides were more common in the G.I. Generation (also known as the World War II generation). However, it is important to note that our data is not evenly distributed among the different generations, with limited data available for the G.I. Generation and Millennials. This necessitates further investigation to draw more reliable conclusions.

To validate the insights from the plots and determine their statistical significance, we will employ statistical tests. These tests will help assess if the observed patterns are statistically significant or merely due to random variation.

3.3.2.1 T_test for Sex

To determine if the assumption of Homogeneity of Variance is satisfied, we can employ the Levene’s test. This statistical test allows us to assess if the variances are equal across the different groups under consideration. By conducting the Levene’s test, we can evaluate if the Homogeneity of Variance assumption holds true in our data.

leveneTest(log_suicide_ratio ~ sex, data = data)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     1  29.423 5.877e-08 ***
      22808                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
t.test(log_suicide_ratio ~ sex, data = data, var.equal = FALSE)

    Welch Two Sample t-test

data:  log_suicide_ratio by sex
t = -91.948, df = 22781, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -1.211644 -1.161065
sample estimates:
mean in group Female   mean in group Male 
           -9.770343            -8.583988 

The test results indicate that there is a statistically significant difference between males and females. This finding suggests that the suicide rates significantly vary between the two genders.

3.3.2.2 ANOVA for Age

Since we have multiple age groups to compare, we can employ the ANOVA (Analysis of Variance) test. The hypothesis for the ANOVA test is as follows:

H0: The mean suicide ratio is equal for all age groups. H1: There is at least one age group with a different mean suicide ratio.

By conducting the ANOVA test, we can determine if there is a statistically significant difference in the mean suicide ratios among the various age groups.

# Fit the model
age_anova <- aov(log_suicide_ratio ~ age, data = data)

# Run the ANOVA
anova_result <- anova(age_anova)

# Print the result
print(anova_result)
Analysis of Variance Table

Response: log_suicide_ratio
             Df Sum Sq Mean Sq F value    Pr(>F)    
age           4   2609  652.26  549.52 < 2.2e-16 ***
Residuals 22805  27069    1.19                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Testing Normality of Residuals Assumption for ANOVA

# Create a data frame for residuals
residuals_df <- data.frame(residuals = residuals(age_anova))

# Create histogram of residuals
hist_plot <- ggplot(residuals_df, aes(x = residuals)) +
  geom_histogram(fill = 'steelblue', color = 'black', bins = 30) +
  theme_minimal() +
  labs(x = "Residuals", y = "Frequency",
       title = "Histogram of Residuals")

# Create Q-Q plot of residuals
qq_plot <- ggplot(residuals_df, aes(sample = residuals)) +
  geom_qq(color = 'steelblue') +
  geom_qq_line(color = 'red') +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

# Arrange the plots side by side using the gridExtra package
library(gridExtra)
grid.arrange(hist_plot, qq_plot, ncol = 2)

Testing Homogeneity of Variances Assumption for ANOVA

leveneTest(log_suicide_ratio ~ age, data = data)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     4  55.038 < 2.2e-16 ***
      22805                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
bartlett.test(log_suicide_ratio ~ age, data = data)

    Bartlett test of homogeneity of variances

data:  log_suicide_ratio by age
Bartlett's K-squared = 250.28, df = 4, p-value < 2.2e-16

Since the p-value from both tests is small, we reject the null hypothesis, indicating that the variances are not equal across different groups. In this scenario, using the ANOVA test may not provide accurate results. As an alternative, we can employ the Kruskal-Wallis test, which is a non-parametric test suitable for situations where the assumption of equal variances is violated.

kruskal.test(log_suicide_ratio ~ age, data = data)

    Kruskal-Wallis rank sum test

data:  log_suicide_ratio by age
Kruskal-Wallis chi-squared = 2043.1, df = 4, p-value < 2.2e-16

The result of the Kruskal-Wallis test aligns with that of the ANOVA. The obtained p-value is significantly small, indicating that there is a statistically significant difference in the means of the target variable across the levels of the categorical variable. However, the ANOVA alone does not provide information about which specific groups have different means.

To identify the specific groups with significant mean differences, we can employ Tukey’s Honest Significant Difference (HSD) test. This post-hoc test allows us to conduct pairwise comparisons and determine which groups exhibit statistically significant differences in their means. By performing further investigations using the Tukey’s HSD test, we can gain more insights into the specific group differences.

TukeyHSD(age_anova)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = log_suicide_ratio ~ age, data = data)

$age
                 diff        lwr       upr   p adj
25-34-15-24 0.2162285 0.15399869 0.2784584 0.0e+00
35-54-15-24 0.3288462 0.26661639 0.3910761 0.0e+00
55-74-15-24 0.4999026 0.43767272 0.5621324 0.0e+00
75+-15-24   1.0029526 0.94072278 1.0651825 0.0e+00
35-54-25-34 0.1126177 0.05038784 0.1748475 7.7e-06
55-74-25-34 0.2836740 0.22144417 0.3459039 0.0e+00
75+-25-34   0.7867241 0.72449423 0.8489539 0.0e+00
55-74-35-54 0.1710563 0.10882648 0.2332862 0.0e+00
75+-35-54   0.6741064 0.61187654 0.7363362 0.0e+00
75+-55-74   0.5030501 0.44082021 0.5652799 0.0e+00

3.3.2.3 ANOVA for Generation

# Fit the model
generation_anova <- aov(log_suicide_ratio ~ generation, data = data)

# Run the ANOVA
anova_result <- anova(generation_anova)

# Print the result
print(anova_result)
Analysis of Variance Table

Response: log_suicide_ratio
              Df  Sum Sq Mean Sq F value    Pr(>F)    
generation     4  2251.8  562.95   468.1 < 2.2e-16 ***
Residuals  22805 27425.9    1.20                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
leveneTest(log_suicide_ratio ~ generation, data = data)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     4  35.672 < 2.2e-16 ***
      22805                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
kruskal.test(log_suicide_ratio ~ generation, data = data)

    Kruskal-Wallis rank sum test

data:  log_suicide_ratio by generation
Kruskal-Wallis chi-squared = 1807.2, df = 4, p-value < 2.2e-16

The obtained small p-values indicate that there is a statistically significant difference in the suicide rate among different generations. This finding suggests that the suicide rates vary significantly across the different generational cohorts.

3.3.3 Multivariate Analysis

3.3.3.2 Age differences, by Continent

global_average <- (sum(as.numeric(data$suicides_no)) / sum(as.numeric(data$population))) * 100000

data %>%
  group_by(continent, age) %>%
  summarize(n = n(), 
            suicides = sum(as.numeric(suicides_no)), 
            population = sum(as.numeric(population)), 
            suicide_per_100k = (suicides / population) * 100000, .groups= "drop") %>%
  ggplot(aes(x = continent, y = suicide_per_100k, fill = age)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Age Disparity, by Continent",
       x = "Continent", 
       y = "Suicides per 100k", 
       fill = "Age")+
  coord_flip()

In the regions of the Americas, Asia, and Europe, which comprise the majority of the dataset, the suicide rate tends to increase with age. However, it is important to note that for Oceania and Africa, the highest suicide rates are observed among individuals aged 25 to 34. Nevertheless, due to the limited availability of data for Africa, this particular finding may not be entirely reliable. Further investigation and data collection are necessary to provide more accurate insights into the suicide rates in Africa.

3.3.3.3 Gender differences, by Continent

data %>%
  group_by(continent, sex) %>%
  summarize(n = n(), 
            suicides = sum(as.numeric(suicides_no)), 
            population = sum(as.numeric(population)), 
            suicide_per_100k = (suicides / population) * 100000,
            .groups = "drop") %>%
  ggplot(aes(x = continent, y = suicide_per_100k, fill = sex)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Gender Disparity, by Continent",
   x = "Continent", 
   y = "Suicides per 100k", 
   fill = "Sex") +
  coord_flip()

Between 1985 and 2015, European men faced the highest risk of suicide, with a rate of approximately 30 suicides per 100,000 population per year. In comparison, Asia had the lowest overrepresentation of male suicide, with the suicide rate for men being around 2.5 times higher than that for women. Conversely, in Europe, the male suicide rate was approximately 3.9 times higher than the female suicide rate, indicating a greater disparity between genders in suicide rates compared to Asia.

3.3.3.4 Gender differences, by Country

# Overall suicide rate by country and continent
country_long <- data %>%
  group_by(country, continent) %>%
  summarize(suicide_per_100k = (sum(suicides_no, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1e5, .groups = "drop") %>%
  mutate(sex = "OVERALL")

# Suicide rate by country, continent, and sex
sex_country_long <- data %>%
  group_by(country, continent, sex) %>%
  summarize(suicide_per_100k = (sum(suicides_no, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1e5, .groups = "drop")

# Pivot the data to wide format for visualization, and calculate the difference between Male and Female suicide rates
sex_country_wide <- sex_country_long %>%
  pivot_wider(names_from = sex, values_from = suicide_per_100k) %>%
  arrange(Male - Female)

# Convert 'country' to ordered factor based on difference in suicide rates between Male and Female
ordered_countries <- sex_country_wide$country
sex_country_wide$country <- factor(sex_country_wide$country, ordered = TRUE, levels = ordered_countries)
sex_country_long$country <- factor(sex_country_long$country, ordered = TRUE, levels = ordered_countries)

# Visualization
ggplot(sex_country_wide, aes(y = country, color = sex)) + 
  geom_dumbbell(aes(x=Female, xend=Male), color = "grey", size = 0.5) + 
  geom_point(data = sex_country_long, aes(x = suicide_per_100k), size = 0.5) +
  geom_point(data = country_long, aes(x = suicide_per_100k), size = 0.5) + 
  geom_vline(xintercept = global_average, linetype = 2, color = "grey35", linewidth = 0.5) +
  theme(axis.text.y = element_text(size = 1), legend.position = c(0.85, 0.2)) + 
  scale_x_continuous(breaks = seq(0, round(max(sex_country_wide$Male, na.rm = TRUE) + 10, -1), 10)) +
  labs(title = "Gender Disparity, by Continent & Country", 
       subtitle = "Ordered by difference in deaths per 100k.", 
       x = "Suicides per 100k", 
       y = "Country", 
       color = "Sex")

country_gender_prop <- sex_country_wide %>%
  mutate(Male_Proportion = Male / (Female + Male)) %>%
  arrange(Male_Proportion)

sex_country_long$country <- factor(sex_country_long$country, 
                                   ordered = T,
                                   levels = country_gender_prop$country)

ggplot(sex_country_long, aes(y = suicide_per_100k, x = country, fill = sex)) + 
  geom_bar(position = "fill", stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Proportions of suicides that are Male & Female, by Country", 
       x = "Country", 
       y = "Suicides per 100k",
       fill = "Sex") + 
  coord_flip() +
  theme(
  legend.position = "right",
  legend.key.size = unit(0.25, "cm"),
  plot.title = element_text(hjust = 0.5),
  axis.text.y = element_text(size= 4))

The over representation of men in suicide deaths is a widespread phenomenon observed in various countries. Although women may have higher rates of depression and suicidal thoughts, it is men who are more likely to die by suicide. This paradoxical pattern, known as the gender paradox in suicidal behavior, highlights the complex interplay of factors such as societal expectations, help-seeking behaviors, and coping mechanisms that contribute to the gender disparity in suicide rates. It underscores the need for further research and targeted interventions to address this issue and reduce the burden of suicide among both men and women. links

4.Model

4.1 modification feutures

In this chapter, we begin with a comprehensive overview of our variables, identifying their characteristics and potential areas for refinement. We subsequently make necessary adjustments to improve their suitability for our analysis

data1 <-data.frame(data)

In the initial phase of our analysis, we focus on refining our dataset for more accurate and meaningful results. Specifically, we remove certain columns that are not contributing to our understanding or prediction of the suicide ratio.

As we’ve discussed earlier, variables such as ‘population’, ‘suicide_no’, and their related transformations or scaled versions inherently have a strong association with our target, the ‘suicide_ratio’. The ‘suicide_ratio’ is an estimate of the likelihood of an individual committing suicide in a specific demographic group or country.

While it might seem that ‘population’ would be a beneficial predictor for the ‘suicide_ratio’, including it may skew our results, leading to biased estimations. This is because it’s not the mere size of the population, but specific characteristics within that population that lead to increased suicide ratios.

To navigate this challenge, we incorporate different features to transform and convey the crucial information contained in these variables, without directly using them. This way, we aim to create a model that captures the nuances and complexity of the factors contributing to the suicide ratio.

remove_var = c("sqrt_suicide_no","population","log_population","new_suicides_no","new_suicide_ratio","scaled_population","scaled_log_population","scaled_GDP_for_year","scaled_log_GDP_year","log_suicide_no","scaled_log_GDP_capita","sqrt_population","sqrt_suicide_no","suicides_no" )
data1 <- dplyr::select(data1, -dplyr::one_of(remove_var))

In the section dedicated to outlier analysis, we observed that the log-transformed version of ‘suicide_ratio’ is significantly more resilient to outliers than the original ‘suicide_ratio’ variable. This discovery makes ‘log(suicide_ratio)’ a preferred candidate for our target variable, especially given its normal distribution which is a desirable property for many statistical models.

To prepare our dataset for further modeling, we standardize our predictors by rescaling them to have a mean of zero and a standard deviation of one.Finally, to maintain a tidy dataset, we drop the original untransformed and unscaled columns. This leaves us with a clean, standardized dataset that is ready for the next stages of our analysis and modeling process.

scale_columns <- function(data, columns_to_scale) {
  # Loop over the columns
  for (col in columns_to_scale) {
    # Check if the column exists and is not all NA
    if (!col %in% names(data) || all(is.na(data[[col]]))) {
      message(paste("Column", col, "does not exist or is all NA. Skipping..."))
      next
    }
    # Create a new column name
    new_col_name <- paste0("scaled_", col)
    
    # Scale the column
    data[[new_col_name]] <- scale(data[[col]])
  }
  
  # Drop the original columns
  data <- data[, !(names(data) %in% columns_to_scale)]
  
  return(data)
}
scaled_var = c("GDP_for_year","GDP_per_capita","life_exp","avg_temp","min_temp","max_temp","log_GDP_year","log_GDP_capita","sqrt_GDP_year","sqrt_GDP_capita","log_suicide_ratio","suicide_ratio","sqrt_suicide_ratio")
data2 <- scale_columns(data1,scaled_var)

4.2 Multicollinearity

4.2.1 Continuous Variable

In this section, we employ techniques such as heatmaps and Variance Inflation Factor (VIF) to investigate potential collinearity among our variables.

A heatmap is a valuable visualization tool that illustrates the correlation matrix through a gradient color scheme. By visually representing the correlation coefficients, a heatmap can reveal patterns and relationships among variables, highlighting any potential multicollinearity issues.

On the other hand, VIF is a numerical measure that quantifies the severity of multicollinearity in a regression analysis. It gauges the amount of multicollinearity by examining how much the variance of the estimated regression coefficients is increased due to multicollinearity. A high VIF suggests a high degree of collinearity with other variables, warranting attention.

These techniques collectively give us a holistic view of the correlation structure among our variables, aiding in feature selection and model performance improvement.

# we create a new dataframe which only includes numeric columns using sapply
numeric_data <- data2[sapply(data2, is.numeric)]


width <- 25
height <- 25
options(repr.plot.width = width, repr.plot.height = height)


corr_matrix <- cor(numeric_data)

# Round the correlation matrix to 3 decimal places
rounded_corr <- round(corr_matrix, 3)

# Create the correlation plot
ggcorrplot(rounded_corr, 
           lab = TRUE, 
           lab_size = 1.5, 
           method = "circle", 
           pch = 1, 
           colors = c("red", "#ebebeb", "#13527a")) +
  theme(axis.text.x = element_text(size = 10))

there is strong corrilation between some of the variables

VIF
remove_var <- c("scaled_sqrt_suicide_ratio","scaled_log_suicide_ratio")
data_simple <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))
remove_var <- c("scaled_sqrt_suicide_ratio","scaled_suicide_ratio")
data_log <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))
remove_var <- c("scaled_log_suicide_ratio","scaled_suicide_ratio")
data_sqrt <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))

and for each target we make a vif graph

mod.linear <- lm(scaled_suicide_ratio~ ., data = data_simple)
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")

As observable from our analysis, numerous variables exhibit high VIF, a sign of multicollinearity. This isn’t surprising given that many variables were derived from one another through transformations. To handle this, we need to employ a strategy of variable clustering. Each cluster would contain variables that are highly correlated with one another, indicating potential multicollinearity.

From each cluster, we would then select the variable that strikes the best balance between having the highest correlation with our target and the least susceptibility to outliers. This approach enables us to maintain essential information while mitigating the negative effects of multicollinearity, thereby enhancing our model’s predictive performance.

gdp_var <-c("scaled_GDP_for_year","scaled_GDP_per_capita","scaled_log_GDP_year", "scaled_log_GDP_capita","scaled_sqrt_GDP_year","scaled_sqrt_GDP_capita")
temp_var<-c("scaled_min_temp","scaled_avg_temp","scaled_max_temp")

for suicide_ratio scaled_min_temp and scaled_GPD_per_year lets do this test again

mod.linear <- lm(scaled_suicide_ratio~ ., data = subset((data_simple),select = c(year,scaled_life_exp,scaled_min_temp,scaled_GDP_for_year,scaled_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))
ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")

we do same for log_suicide_ratio and sqrt_suicide_ratio for log_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita suicide_ratio scaled_min_temp and scaled_GPD_per_year lets do this test again

mod.linear <- lm(scaled_log_suicide_ratio~ ., data = subset((data_log),select = c(year,scaled_life_exp,scaled_avg_temp,scaled_log_GDP_capita,scaled_log_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")

suicide_ratio scaled_min_temp and scaled_GPD_per_year for log_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita for sqrt_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita lets do this test again

mod.linear <- lm(scaled_sqrt_suicide_ratio~ ., data = subset((data_sqrt),select = c(year,scaled_life_exp,scaled_log_GDP_year,scaled_avg_temp,scaled_sqrt_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")

as we can see all of them have value less than 5 and we can say that there is no coliniarity between these variable. ### 4.2.2 Categorical variable The concept of multicollinearity is a bit less straightforward when applied to categorical variables, particularly because categorical variables can take on limited, and usually few, distinct values.

However, multicollinearity can still occur with categorical variables. For example, suppose you have a dataset of cars, and you have two variables: “Brand” and “Country”. If every “Brand” uniquely maps to a “Country” (e.g., if ‘Toyota’ is always ‘Japan’, ‘Ford’ is always ‘USA’, etc.), then these two variables are perfectly multicollinear. we can use chi squre but chisq is very sensetive to unbalanced variable. we will ues Cramér’s V for categorical variables. Cramér’s V is a statistical measure that assesses the strength of association between two nominal variables. It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.

Cramér’s V ranges from 0 (indicating no association between the variables) to 1 (indicating a perfect association). It could be seen as an extension of the correlation coefficient to nominal data.

Cramér’s V is symmetrical — it does not matter which variable we consider as independent or dependent. The formula for Cramér’s V is:

V = sqrt((X^2/n) / (min(k-1, r-1)))

where:

X^2 is the chi-squared statistic, n is the total sample size, k is the number of columns, r is the number of rows in the contingency table. Just like with correlation, a value close to 0 indicates little association between the variables, and a value close to 1 indicates a strong association. However, unlike correlation, Cramér’s V can only reach 1 in the case of complete association (all cells other than the diagonal are 0), or when the number of rows equals the number of columns. first seperate categorical var

factor_vars <- sapply(data, is.factor)

factor_vars_names <- names(data)[factor_vars]
factor_vars_names
 [1] "country"                   "sex"                       "age"                      
 [4] "generation"                "continent"                 "population_bine_jenks"    
 [7] "population_bine_median"    "avg_temp_bine_jenks"       "min_temp_bine_jenks"      
[10] "max_temp_bine_jenks"       "gdp_per_capita_bine_jenks" "gdp_bine_median"          

then we apply Cramér’s V for each pair of this variable.

# Retrieve all the categorical variable names
factor_vars_names <- names(data[sapply(data, is.factor)])

# Initialize a data frame to hold the Cramer's V values
V_df <- data.frame(matrix(nrow = length(factor_vars_names), ncol = length(factor_vars_names)))
names(V_df) <- factor_vars_names
rownames(V_df) <- factor_vars_names

# Loop over each pair of variables
for(i in 1:length(factor_vars_names)){
  for(j in 1:length(factor_vars_names)){
    if(i != j){
      
      # Create a contingency table
      tab <- table(data[[factor_vars_names[i]]], data[[factor_vars_names[j]]])
      
      # Perform Chi-square test
      chi_sq <- chisq.test(tab)
      
      # Calculate Cramer's V
      n <- sum(tab) # total number of observations
      k <- min(dim(tab)) # number of rows or columns (whichever is smaller)
      V <- sqrt(chi_sq$statistic / (n * (k - 1)))
      
      V_df[i,j] <- V
      
      cat("Cramer's V for", factor_vars_names[i], "and", factor_vars_names[j], ":", V, "\n")
      
    } else {
      V_df[i,j] <- NA
    }
  }
}
Cramer's V for country and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for country and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for country and generation : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for country and continent : 1 
Warning: Chi-squared approximation may be incorrect
Cramer's V for country and population_bine_jenks : 0.9555141 
Cramer's V for country and population_bine_median : 0.9658267 
Cramer's V for country and avg_temp_bine_jenks : 0.919768 
Cramer's V for country and min_temp_bine_jenks : 0.8440007 
Cramer's V for country and max_temp_bine_jenks : 0.9293818 
Warning: Chi-squared approximation may be incorrect
Cramer's V for country and gdp_per_capita_bine_jenks : 0.8451701 
Cramer's V for country and gdp_bine_median : 0.6755723 
Cramer's V for sex and country : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for sex and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for sex and generation : NaN 
Cramer's V for sex and continent : 0 
Cramer's V for sex and population_bine_jenks : 0 
Cramer's V for sex and population_bine_median : 0 
Cramer's V for sex and avg_temp_bine_jenks : 0 
Cramer's V for sex and min_temp_bine_jenks : 0 
Cramer's V for sex and max_temp_bine_jenks : 0 
Cramer's V for sex and gdp_per_capita_bine_jenks : 0 
Cramer's V for sex and gdp_bine_median : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and country : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and sex : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and generation : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and continent : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and population_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and population_bine_median : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and avg_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and min_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and max_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and gdp_per_capita_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for age and gdp_bine_median : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and country : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and sex : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and continent : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and population_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and population_bine_median : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and avg_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and min_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and max_temp_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and gdp_per_capita_bine_jenks : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for generation and gdp_bine_median : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for continent and country : 1 
Cramer's V for continent and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for continent and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for continent and generation : NaN 
Cramer's V for continent and population_bine_jenks : 0.1360322 
Cramer's V for continent and population_bine_median : 0.2201354 
Cramer's V for continent and avg_temp_bine_jenks : 0.6364753 
Cramer's V for continent and min_temp_bine_jenks : 0.501039 
Cramer's V for continent and max_temp_bine_jenks : 0.626857 
Cramer's V for continent and gdp_per_capita_bine_jenks : 0.2519669 
Cramer's V for continent and gdp_bine_median : 0.2615279 
Warning: Chi-squared approximation may be incorrect
Cramer's V for population_bine_jenks and country : 0.9555141 
Cramer's V for population_bine_jenks and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for population_bine_jenks and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for population_bine_jenks and generation : NaN 
Cramer's V for population_bine_jenks and continent : 0.1360322 
Cramer's V for population_bine_jenks and population_bine_median : 0.5621135 
Cramer's V for population_bine_jenks and avg_temp_bine_jenks : 0.08665067 
Cramer's V for population_bine_jenks and min_temp_bine_jenks : 0.1588227 
Cramer's V for population_bine_jenks and max_temp_bine_jenks : 0.1572724 
Cramer's V for population_bine_jenks and gdp_per_capita_bine_jenks : 0.1833429 
Cramer's V for population_bine_jenks and gdp_bine_median : 0.09180943 
Cramer's V for population_bine_median and country : 0.9658267 
Cramer's V for population_bine_median and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for population_bine_median and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for population_bine_median and generation : NaN 
Cramer's V for population_bine_median and continent : 0.2201354 
Cramer's V for population_bine_median and population_bine_jenks : 0.5621135 
Cramer's V for population_bine_median and avg_temp_bine_jenks : 0.3237402 
Cramer's V for population_bine_median and min_temp_bine_jenks : 0.2851608 
Cramer's V for population_bine_median and max_temp_bine_jenks : 0.3904609 
Cramer's V for population_bine_median and gdp_per_capita_bine_jenks : 0.1419864 
Cramer's V for population_bine_median and gdp_bine_median : 0.1003279 
Cramer's V for avg_temp_bine_jenks and country : 0.919768 
Cramer's V for avg_temp_bine_jenks and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for avg_temp_bine_jenks and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for avg_temp_bine_jenks and generation : NaN 
Cramer's V for avg_temp_bine_jenks and continent : 0.6364753 
Cramer's V for avg_temp_bine_jenks and population_bine_jenks : 0.08665067 
Cramer's V for avg_temp_bine_jenks and population_bine_median : 0.3237402 
Cramer's V for avg_temp_bine_jenks and min_temp_bine_jenks : 0.844372 
Cramer's V for avg_temp_bine_jenks and max_temp_bine_jenks : 0.7520485 
Cramer's V for avg_temp_bine_jenks and gdp_per_capita_bine_jenks : 0.3638312 
Cramer's V for avg_temp_bine_jenks and gdp_bine_median : 0.260163 
Cramer's V for min_temp_bine_jenks and country : 0.8440007 
Cramer's V for min_temp_bine_jenks and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for min_temp_bine_jenks and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for min_temp_bine_jenks and generation : NaN 
Cramer's V for min_temp_bine_jenks and continent : 0.501039 
Cramer's V for min_temp_bine_jenks and population_bine_jenks : 0.1588227 
Cramer's V for min_temp_bine_jenks and population_bine_median : 0.2851608 
Cramer's V for min_temp_bine_jenks and avg_temp_bine_jenks : 0.844372 
Cramer's V for min_temp_bine_jenks and max_temp_bine_jenks : 0.7471725 
Cramer's V for min_temp_bine_jenks and gdp_per_capita_bine_jenks : 0.2685126 
Cramer's V for min_temp_bine_jenks and gdp_bine_median : 0.2143205 
Cramer's V for max_temp_bine_jenks and country : 0.9293818 
Cramer's V for max_temp_bine_jenks and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for max_temp_bine_jenks and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for max_temp_bine_jenks and generation : NaN 
Cramer's V for max_temp_bine_jenks and continent : 0.626857 
Cramer's V for max_temp_bine_jenks and population_bine_jenks : 0.1572724 
Cramer's V for max_temp_bine_jenks and population_bine_median : 0.3904609 
Cramer's V for max_temp_bine_jenks and avg_temp_bine_jenks : 0.7520485 
Cramer's V for max_temp_bine_jenks and min_temp_bine_jenks : 0.7471725 
Cramer's V for max_temp_bine_jenks and gdp_per_capita_bine_jenks : 0.3330357 
Cramer's V for max_temp_bine_jenks and gdp_bine_median : 0.249758 
Warning: Chi-squared approximation may be incorrect
Cramer's V for gdp_per_capita_bine_jenks and country : 0.8451701 
Cramer's V for gdp_per_capita_bine_jenks and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for gdp_per_capita_bine_jenks and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for gdp_per_capita_bine_jenks and generation : NaN 
Cramer's V for gdp_per_capita_bine_jenks and continent : 0.2519669 
Cramer's V for gdp_per_capita_bine_jenks and population_bine_jenks : 0.1833429 
Cramer's V for gdp_per_capita_bine_jenks and population_bine_median : 0.1419864 
Cramer's V for gdp_per_capita_bine_jenks and avg_temp_bine_jenks : 0.3638312 
Cramer's V for gdp_per_capita_bine_jenks and min_temp_bine_jenks : 0.2685126 
Cramer's V for gdp_per_capita_bine_jenks and max_temp_bine_jenks : 0.3330357 
Cramer's V for gdp_per_capita_bine_jenks and gdp_bine_median : 0.5334269 
Cramer's V for gdp_bine_median and country : 0.6755723 
Cramer's V for gdp_bine_median and sex : 0 
Warning: Chi-squared approximation may be incorrect
Cramer's V for gdp_bine_median and age : NaN 
Warning: Chi-squared approximation may be incorrect
Cramer's V for gdp_bine_median and generation : NaN 
Cramer's V for gdp_bine_median and continent : 0.2615279 
Cramer's V for gdp_bine_median and population_bine_jenks : 0.09180943 
Cramer's V for gdp_bine_median and population_bine_median : 0.1003279 
Cramer's V for gdp_bine_median and avg_temp_bine_jenks : 0.260163 
Cramer's V for gdp_bine_median and min_temp_bine_jenks : 0.2143205 
Cramer's V for gdp_bine_median and max_temp_bine_jenks : 0.249758 
Cramer's V for gdp_bine_median and gdp_per_capita_bine_jenks : 0.5334269 
# Replace NA values with 0
V_df[is.na(V_df)] <- 0


print(V_df)
NA
#install.packages("pheatmap")
library(pheatmap)

# Make the heatmap
pheatmap(V_df, color = colorRampPalette(c("navy", "white", "firebrick3"))(25))

As observed, the ‘country’ variable demonstrates significant associations with numerous variables. This is expected given that these variables were created via a ‘group_by’ operation on ‘country’.

However, the crucial observation is the substantial association among ‘avg_temp_bine_jenks’, ‘min_temp_bine_jenks’, and ‘max_temp_bine_jenks’. For model efficiency, we should select one from this set.

To guide this selection, we ran several linear models to evaluate compatibility between these temperature variables and our potential targets (‘suicide_ratio’, ‘log_suicide_ratio’, and ‘sqrt_suicide_ratio’).

We compiled a dataframe featuring our three targets and the three temperature variables. The dataframe entries represent the adjusted R-squared values for each corresponding pair, providing a basis for optimal feature selection.

targets <- c("scaled_suicide_ratio", "scaled_log_suicide_ratio", "scaled_sqrt_suicide_ratio")
variables <- c("avg_temp_bine_jenks","min_temp_bine_jenks", "max_temp_bine_jenks"
)

adjusted_r2 <- matrix(nrow = length(targets), ncol = length(variables))
rownames(adjusted_r2) <- targets
colnames(adjusted_r2) <- variables

# loop over each target and variable
for (target in targets) {
  for (var in variables) {
    
    formula <- as.formula(paste(target, var, sep = " ~ "))
    
    # fit the linear model
    model <- lm(formula, data = data2)
    
    adjusted_r2[target, var] <- summary(model)$adj.r.squared
  }
}

# convert the matrix to a data frame
adjusted_r2_df <- as.data.frame(adjusted_r2)


print(adjusted_r2_df)

Considering the three potential target variables, ‘min_temp_bine_jenks’ consistently shows better performance in terms of R-squared values.

Thus far, we have categorized our features into continuous and categorical candidates.

Our final feature candidates, as determined by their collinearity and correlation with the target variable, are as follows:

scaled_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_GDP_for_year","min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_min_temp")
scaled_log_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_log_GDP_capita","min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_avg_temp")
scaled_sqrt_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_log_GDP_capita" ,"min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_avg_temp")

4.3 Models selection

so for we have 3 target variable and for each one we found different proper variable. in this section we inspect different models with different criteria

To apply linear regression we need to make sure that four conditions are satisfied:

1.No multicollinearity: no high correlation between the independent variables; 2.Linearity: there must be a linear relationship between the target variablesand the other variables; 3.Normality: the residuals must be normally distributed; 4.Homoscedasticity: the residuals must have a constant variance

in previoues section we inspect multiliniarity problem and gave proper solution for each targts lets first make a simple model for each variable and see which conditions will meet.

#suicide_ratio
formula <- as.formula(paste("scaled_suicide_ratio", "~", paste(scaled_suicide_ratio_var, collapse = " + ")))

model_suicide_ratio <- lm(formula, data = data2)
summary(model_suicide_ratio)

Call:
lm(formula = formula, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3028 -0.3612 -0.0627  0.2668  9.7397 

Coefficients: (4 not defined because of singularities)
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         14.8096072  1.1780043  12.572  < 2e-16 ***
year                                -0.0078654  0.0005886 -13.364  < 2e-16 ***
countryAntigua and Barbuda          -0.1950023  0.1017517  -1.916 0.055320 .  
countryArgentina                     0.4781366  0.0878931   5.440 5.38e-08 ***
countryArmenia                       0.0076525  0.0668824   0.114 0.908908    
countryAruba                         0.3885659  0.1158588   3.354 0.000798 ***
countryAustralia                     0.6009979  0.0854668   7.032 2.09e-12 ***
countryAustria                       1.2725577  0.0763190  16.674  < 2e-16 ***
countryAzerbaijan                   -0.1115302  0.0702925  -1.587 0.112604    
countryBahamas                      -0.1172060  0.0987623  -1.187 0.235339    
countryBahrain                      -0.0838931  0.0875715  -0.958 0.338076    
countryBarbados                     -0.0541948  0.1045530  -0.518 0.604221    
countryBelarus                       1.6607261  0.0720502  23.050  < 2e-16 ***
countryBelgium                       1.1143428  0.0726324  15.342  < 2e-16 ***
countryBelize                        0.1584722  0.0937904   1.690 0.091110 .  
countryBrazil                       -0.3606057  0.1659609  -2.173 0.029803 *  
countryBulgaria                      0.9547131  0.0610557  15.637  < 2e-16 ***
countryCanada                        0.6390917  0.1274398   5.015 5.35e-07 ***
countryChile                         0.4254113  0.0595136   7.148 9.06e-13 ***
countryColombia                      0.1572541  0.1158979   1.357 0.174848    
countryCosta Rica                    0.1950856  0.0997969   1.955 0.050616 .  
countryCroatia                       1.2095605  0.0651363  18.570  < 2e-16 ***
countryCuba                          1.0789805  0.0958202  11.260  < 2e-16 ***
countryCyprus                        0.0414392  0.0876464   0.473 0.636361    
countryCzech Republic                0.9262792  0.0655212  14.137  < 2e-16 ***
countryDenmark                       0.7333559  0.1338261   5.480 4.30e-08 ***
countryEcuador                       0.1478544  0.0900761   1.641 0.100720    
countryEl Salvador                   0.4216083  0.0988820   4.264 2.02e-05 ***
countryEstonia                       1.4605530  0.0737846  19.795  < 2e-16 ***
countryFiji                          0.1297922  0.1080724   1.201 0.229774    
countryFinland                       1.2114869  0.0876298  13.825  < 2e-16 ***
countryFrance                        1.1373617  0.1085814  10.475  < 2e-16 ***
countryGeorgia                       0.0704287  0.0658792   1.069 0.285055    
countryGermany                       0.7967927  0.1106660   7.200 6.21e-13 ***
countryGreece                        0.0369190  0.0669733   0.551 0.581468    
countryGrenada                      -0.0819338  0.1039626  -0.788 0.430642    
countryGuatemala                    -0.0266986  0.0894393  -0.299 0.765316    
countryGuyana                        1.0963582  0.1028369  10.661  < 2e-16 ***
countryHungary                       1.8039871  0.0642018  28.099  < 2e-16 ***
countryIceland                       0.5972588  0.0766395   7.793 6.82e-15 ***
countryIreland                       0.4620733  0.0691062   6.686 2.34e-11 ***
countryIsrael                        0.3335001  0.0766149   4.353 1.35e-05 ***
countryItaly                         0.3379881  0.1045898   3.232 0.001233 ** 
countryJamaica                      -0.2071754  0.1040103  -1.992 0.046397 *  
countryJapan                         0.6649714  0.1535815   4.330 1.50e-05 ***
countryKazakhstan                    1.6279453  0.0811298  20.066  < 2e-16 ***
countryKiribati                      0.1185982  0.1149850   1.031 0.302353    
countryKuwait                       -0.1142358  0.0816881  -1.398 0.161995    
countryKyrgyzstan                    0.6596865  0.0774432   8.518  < 2e-16 ***
countryLatvia                        1.5848700  0.0729427  21.728  < 2e-16 ***
countryLithuania                     2.2618971  0.0719226  31.449  < 2e-16 ***
countryLuxembourg                    0.8192498  0.0780965  10.490  < 2e-16 ***
countryMalta                         0.0861410  0.0763500   1.128 0.259231    
countryMauritius                     0.4770723  0.0895077   5.330 9.92e-08 ***
countryMexico                        0.0641878  0.1103882   0.581 0.560927    
countryMontenegro                    0.4242783  0.0821029   5.168 2.39e-07 ***
countryNetherlands                   0.4754732  0.0725494   6.554 5.73e-11 ***
countryNew Zealand                   0.6503765  0.0679812   9.567  < 2e-16 ***
countryNicaragua                     0.2601074  0.1260364   2.064 0.039053 *  
countryNorway                        0.6018882  0.0861976   6.983 2.98e-12 ***
countryPanama                        0.1366746  0.1031305   1.325 0.185098    
countryParaguay                      0.0334652  0.0832579   0.402 0.687727    
countryPhilippines                  -0.0580201  0.1333516  -0.435 0.663500    
countryPoland                        0.7925555  0.0908728   8.722  < 2e-16 ***
countryPortugal                      0.4613540  0.0700223   6.589 4.53e-11 ***
countryPuerto Rico                   0.3867521  0.1009459   3.831 0.000128 ***
countryQatar                        -0.0152730  0.1007138  -0.152 0.879466    
countryRepublic of Korea             1.3593647  0.0909233  14.951  < 2e-16 ***
countryRomania                       0.5954025  0.0752994   7.907 2.75e-15 ***
countryRussian Federation            1.4538614  0.1749608   8.310  < 2e-16 ***
countrySaint Lucia                   0.2035607  0.1022536   1.991 0.046521 *  
countrySaint Vincent and Grenadines  0.1338394  0.1037865   1.290 0.197215    
countrySerbia                        1.1779278  0.0694849  16.952  < 2e-16 ***
countrySeychelles                    0.2565989  0.1077194   2.382 0.017222 *  
countrySingapore                     0.8289333  0.1086373   7.630 2.43e-14 ***
countrySlovakia                      0.5691105  0.0675672   8.423  < 2e-16 ***
countrySlovenia                      1.4988496  0.0727224  20.611  < 2e-16 ***
countrySouth Africa                 -0.0492529  0.0975861  -0.505 0.613766    
countrySpain                         0.4190532  0.0917002   4.570 4.91e-06 ***
countrySri Lanka                     1.8635695  0.1132795  16.451  < 2e-16 ***
countrySuriname                      1.0361597  0.1022244  10.136  < 2e-16 ***
countrySweden                        0.7217350  0.0824977   8.749  < 2e-16 ***
countrySwitzerland                   1.0080982  0.0833744  12.091  < 2e-16 ***
countryThailand                      0.2167633  0.1212717   1.787 0.073883 .  
countryTrinidad and Tobago           0.5834628  0.1017326   5.735 9.86e-09 ***
countryTurkey                        0.0504615  0.1176621   0.429 0.668023    
countryTurkmenistan                  0.3174057  0.0602745   5.266 1.41e-07 ***
countryUkraine                       1.4231449  0.0940799  15.127  < 2e-16 ***
countryUnited Arab Emirates         -0.0519189  0.1190825  -0.436 0.662847    
countryUnited Kingdom                0.2943397  0.1068800   2.754 0.005893 ** 
countryUnited States                 0.1547038  0.1814318   0.853 0.393844    
countryUruguay                       0.9545288  0.0710045  13.443  < 2e-16 ***
countryUzbekistan                    0.2992853  0.0684480   4.372 1.23e-05 ***
sexMale                              0.8977982  0.0088677 101.243  < 2e-16 ***
age.L                                0.5429783  0.0099144  54.767  < 2e-16 ***
age.Q                                0.1022368  0.0099144  10.312  < 2e-16 ***
age.C                                0.1129714  0.0099144  11.395  < 2e-16 ***
age^4                                0.0557506  0.0099144   5.623 1.90e-08 ***
continentAmericas                           NA         NA      NA       NA    
continentAsia                               NA         NA      NA       NA    
continentEurope                             NA         NA      NA       NA    
continentOceania                            NA         NA      NA       NA    
population_bine_jenks.L              0.3398234  0.0975265   3.484 0.000494 ***
population_bine_jenks.Q              0.2752480  0.0636123   4.327 1.52e-05 ***
population_bine_jenks.C              0.0686367  0.0362801   1.892 0.058523 .  
scaled_GDP_for_year                  0.0023482  0.0126586   0.185 0.852839    
min_temp_bine_jenks.L               -0.0072806  0.0256081  -0.284 0.776177    
min_temp_bine_jenks.Q                0.0080247  0.0143677   0.559 0.576494    
gdp_per_capita_bine_jenks.L         -0.0311668  0.0373266  -0.835 0.403741    
gdp_per_capita_bine_jenks.Q          0.0046895  0.0215878   0.217 0.828030    
gdp_per_capita_bine_jenks.C          0.0246067  0.0163459   1.505 0.132242    
scaled_min_temp                      0.0110464  0.0480953   0.230 0.818345    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6696 on 22702 degrees of freedom
Multiple R-squared:  0.5537,    Adjusted R-squared:  0.5516 
F-statistic: 263.2 on 107 and 22702 DF,  p-value: < 2.2e-16
data2 <- data2%>%
  filter(age != '5-14')
#log_suicide_ratio
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model_log_suicide_ratio <- lm(formula, data = data2)
summary(model_log_suicide_ratio)

Call:
lm(formula = formula, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9213 -0.2769  0.0356  0.2978  2.6194 

Coefficients: (4 not defined because of singularities)
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          6.7180196  1.6954611   3.962 7.44e-05 ***
year                                -0.0042127  0.0008389  -5.022 5.16e-07 ***
countryAntigua and Barbuda           2.3031385  0.1042100  22.101  < 2e-16 ***
countryArgentina                     1.0919542  0.0683369  15.979  < 2e-16 ***
countryArmenia                      -0.2704081  0.0514003  -5.261 1.45e-07 ***
countryAruba                         2.4977499  0.1176818  21.225  < 2e-16 ***
countryAustralia                     1.5900352  0.0841628  18.892  < 2e-16 ***
countryAustria                       1.3962189  0.0737829  18.923  < 2e-16 ***
countryAzerbaijan                   -0.6475496  0.0517029 -12.524  < 2e-16 ***
countryBahamas                       1.2983243  0.1005920  12.907  < 2e-16 ***
countryBahrain                       1.1454880  0.1071545  10.690  < 2e-16 ***
countryBarbados                      1.3589670  0.1053554  12.899  < 2e-16 ***
countryBelarus                       1.5162001  0.0621441  24.398  < 2e-16 ***
countryBelgium                       1.5168695  0.0644161  23.548  < 2e-16 ***
countryBelize                        1.8965059  0.0941706  20.139  < 2e-16 ***
countryBrazil                        0.8763168  0.1407621   6.226 4.88e-10 ***
countryBulgaria                      1.2235238  0.0465456  26.287  < 2e-16 ***
countryCanada                        0.6104271  0.1359913   4.489 7.20e-06 ***
countryChile                         0.6945639  0.0521095  13.329  < 2e-16 ***
countryColombia                      0.8520593  0.1033498   8.244  < 2e-16 ***
countryCosta Rica                    1.1210072  0.0975155  11.496  < 2e-16 ***
countryCroatia                       1.5605500  0.0518298  30.109  < 2e-16 ***
countryCuba                          1.9695686  0.0964438  20.422  < 2e-16 ***
countryCyprus                        0.6170689  0.0773764   7.975 1.60e-15 ***
countryCzech Republic                1.1850130  0.0589826  20.091  < 2e-16 ***
countryDenmark                       0.0497837  0.2054182   0.242 0.808509    
countryEcuador                       0.8988255  0.0759110  11.841  < 2e-16 ***
countryEl Salvador                   1.4183500  0.0938476  15.113  < 2e-16 ***
countryEstonia                       1.4605259  0.0699135  20.890  < 2e-16 ***
countryFiji                          1.2484271  0.0993617  12.564  < 2e-16 ***
countryFinland                       1.2762077  0.0942812  13.536  < 2e-16 ***
countryFrance                        1.6880783  0.0840841  20.076  < 2e-16 ***
countryGeorgia                      -0.1408546  0.0531395  -2.651 0.008039 ** 
countryGermany                       1.2442436  0.0880506  14.131  < 2e-16 ***
countryGreece                        0.1883795  0.0560814   3.359 0.000783 ***
countryGrenada                       2.1928018  0.1047615  20.931  < 2e-16 ***
countryGuatemala                     0.3155934  0.0830149   3.802 0.000144 ***
countryGuyana                        2.2232707  0.0998517  22.266  < 2e-16 ***
countryHungary                       1.7311197  0.0523611  33.061  < 2e-16 ***
countryIceland                       1.1061485  0.0960449  11.517  < 2e-16 ***
countryIreland                       0.9067602  0.0634780  14.285  < 2e-16 ***
countryIsrael                        1.1020853  0.0745122  14.791  < 2e-16 ***
countryItaly                         0.8383662  0.0813901  10.301  < 2e-16 ***
countryJamaica                      -0.5097171  0.1027016  -4.963 6.99e-07 ***
countryJapan                         1.6595416  0.1117632  14.849  < 2e-16 ***
countryKazakhstan                    1.5948075  0.0629122  25.350  < 2e-16 ***
countryKiribati                      2.6790982  0.1098995  24.378  < 2e-16 ***
countryKuwait                        0.6872664  0.1027728   6.687 2.33e-11 ***
countryKyrgyzstan                    0.7558994  0.0729091  10.368  < 2e-16 ***
countryLatvia                        1.5259161  0.0669703  22.785  < 2e-16 ***
countryLithuania                     1.8628210  0.0648461  28.727  < 2e-16 ***
countryLuxembourg                    1.4596858  0.0708076  20.615  < 2e-16 ***
countryMalta                         1.0610733  0.0676287  15.690  < 2e-16 ***
countryMauritius                     1.6682225  0.0857462  19.455  < 2e-16 ***
countryMexico                        0.5250957  0.0941143   5.579 2.44e-08 ***
countryMontenegro                    0.4509376  0.0617213   7.306 2.84e-13 ***
countryNetherlands                   0.9999114  0.0649489  15.395  < 2e-16 ***
countryNew Zealand                   1.2168806  0.0595049  20.450  < 2e-16 ***
countryNicaragua                     1.0953209  0.1128264   9.708  < 2e-16 ***
countryNorway                        0.8551319  0.1035942   8.255  < 2e-16 ***
countryPanama                        1.0266985  0.1019069  10.075  < 2e-16 ***
countryParaguay                      0.6673450  0.0839018   7.954 1.89e-15 ***
countryPhilippines                   0.3131289  0.1220407   2.566 0.010301 *  
countryPoland                        1.1537004  0.0742494  15.538  < 2e-16 ***
countryPortugal                      0.9194589  0.0562908  16.334  < 2e-16 ***
countryPuerto Rico                   1.2015571  0.1005941  11.945  < 2e-16 ***
countryQatar                         1.2947877  0.1177229  10.999  < 2e-16 ***
countryRepublic of Korea             1.7351080  0.0712638  24.348  < 2e-16 ***
countryRomania                       0.9682255  0.0586017  16.522  < 2e-16 ***
countryRussian Federation            1.1776785  0.1499683   7.853 4.25e-15 ***
countrySaint Lucia                   2.0867597  0.1041426  20.038  < 2e-16 ***
countrySaint Vincent and Grenadines  2.2824205  0.1049011  21.758  < 2e-16 ***
countrySerbia                        1.3787293  0.0522034  26.411  < 2e-16 ***
countrySeychelles                    2.5356266  0.1066803  23.768  < 2e-16 ***
countrySingapore                     1.9514624  0.1083592  18.009  < 2e-16 ***
countrySlovakia                      0.5910512  0.0599467   9.860  < 2e-16 ***
countrySlovenia                      1.6481195  0.0600362  27.452  < 2e-16 ***
countrySouth Africa                 -0.7609252  0.0762778  -9.976  < 2e-16 ***
countrySpain                         0.9693693  0.0740129  13.097  < 2e-16 ***
countrySri Lanka                     2.4984897  0.1114334  22.421  < 2e-16 ***
countrySuriname                      2.3072065  0.1008992  22.866  < 2e-16 ***
countrySweden                        1.0246462  0.0901457  11.367  < 2e-16 ***
countrySwitzerland                   1.3875146  0.0768983  18.043  < 2e-16 ***
countryThailand                      1.3273935  0.1162242  11.421  < 2e-16 ***
countryTrinidad and Tobago           1.8540224  0.1020284  18.172  < 2e-16 ***
countryTurkey                       -0.0427117  0.0879368  -0.486 0.627178    
countryTurkmenistan                  0.8744881  0.0478318  18.283  < 2e-16 ***
countryUkraine                       1.5377334  0.0711577  21.610  < 2e-16 ***
countryUnited Arab Emirates          0.6835672  0.1299405   5.261 1.45e-07 ***
countryUnited Kingdom                0.7018319  0.0872288   8.046 8.98e-16 ***
countryUnited States                 1.0228650  0.1171657   8.730  < 2e-16 ***
countryUruguay                       1.5931893  0.0587398  27.123  < 2e-16 ***
countryUzbekistan                    0.7599959  0.0494784  15.360  < 2e-16 ***
sexMale                              1.0400454  0.0064787 160.534  < 2e-16 ***
age.L                                0.6347366  0.0072434  87.630  < 2e-16 ***
age.Q                                0.1480973  0.0072434  20.446  < 2e-16 ***
age.C                                0.1207620  0.0072434  16.672  < 2e-16 ***
age^4                                0.0116838  0.0072434   1.613 0.106750    
continentAmericas                           NA         NA      NA       NA    
continentAsia                               NA         NA      NA       NA    
continentEurope                             NA         NA      NA       NA    
continentOceania                            NA         NA      NA       NA    
population_bine_jenks.L              0.0392961  0.0713593   0.551 0.581858    
population_bine_jenks.Q              0.1488624  0.0465793   3.196 0.001396 ** 
population_bine_jenks.C             -0.0054825  0.0266270  -0.206 0.836872    
scaled_log_GDP_capita               -0.0811281  0.0160452  -5.056 4.31e-07 ***
min_temp_bine_jenks.L                0.0255438  0.0181665   1.406 0.159711    
min_temp_bine_jenks.Q                0.0006281  0.0104879   0.060 0.952246    
gdp_per_capita_bine_jenks.L          0.0413718  0.0283975   1.457 0.145163    
gdp_per_capita_bine_jenks.Q         -0.0455872  0.0156630  -2.910 0.003612 ** 
gdp_per_capita_bine_jenks.C          0.0079187  0.0119083   0.665 0.506077    
scaled_avg_temp                     -0.3907802  0.0604192  -6.468 1.01e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4892 on 22702 degrees of freedom
Multiple R-squared:  0.7618,    Adjusted R-squared:  0.7606 
F-statistic: 678.4 on 107 and 22702 DF,  p-value: < 2.2e-16
#ssqrt_uicide_ratio
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

model_sqrt_suicide_ratio <- lm(formula, data = data2)
summary(model_sqrt_suicide_ratio)

Call:
lm(formula = formula, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8872 -0.3105 -0.0238  0.2646  5.0725 

Coefficients: (4 not defined because of singularities)
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          1.7377932  1.9504611   0.891 0.372957    
year                                -0.0015243  0.0009651  -1.579 0.114251    
countryAntigua and Barbuda          -0.0470087  0.1198834  -0.392 0.694973    
countryArgentina                     0.9197022  0.0786149  11.699  < 2e-16 ***
countryArmenia                      -0.1586816  0.0591310  -2.684 0.007290 ** 
countryAruba                         0.8199850  0.1353814   6.057 1.41e-09 ***
countryAustralia                     1.4014686  0.0968210  14.475  < 2e-16 ***
countryAustria                       1.3067676  0.0848799  15.395  < 2e-16 ***
countryAzerbaijan                   -0.2800007  0.0594791  -4.708 2.52e-06 ***
countryBahamas                       0.1353162  0.1157212   1.169 0.242283    
countryBahrain                       0.3317641  0.1232707   2.691 0.007122 ** 
countryBarbados                      0.3269507  0.1212010   2.698 0.006989 ** 
countryBelarus                       1.4867219  0.0714907  20.796  < 2e-16 ***
countryBelgium                       1.3624497  0.0741044  18.386  < 2e-16 ***
countryBelize                        0.5619979  0.1083340   5.188 2.15e-07 ***
countryBrazil                        0.6147527  0.1619330   3.796 0.000147 ***
countryBulgaria                      1.0956692  0.0535461  20.462  < 2e-16 ***
countryCanada                        0.3861691  0.1564447   2.468 0.013579 *  
countryChile                         0.5849833  0.0599468   9.758  < 2e-16 ***
countryColombia                      0.8933616  0.1188937   7.514 5.95e-14 ***
countryCosta Rica                    0.9640306  0.1121820   8.593  < 2e-16 ***
countryCroatia                       1.4197400  0.0596251  23.811  < 2e-16 ***
countryCuba                          1.8406770  0.1109491  16.590  < 2e-16 ***
countryCyprus                        0.2902955  0.0890140   3.261 0.001111 ** 
countryCzech Republic                1.0380430  0.0678536  15.298  < 2e-16 ***
countryDenmark                      -0.1453713  0.2363134  -0.615 0.538453    
countryEcuador                       0.7787321  0.0873281   8.917  < 2e-16 ***
countryEl Salvador                   1.2168885  0.1079624  11.271  < 2e-16 ***
countryEstonia                       1.3377213  0.0804286  16.632  < 2e-16 ***
countryFiji                          0.6648421  0.1143058   5.816 6.10e-09 ***
countryFinland                       1.1244944  0.1084613  10.368  < 2e-16 ***
countryFrance                        1.5136200  0.0967305  15.648  < 2e-16 ***
countryGeorgia                      -0.0452623  0.0611317  -0.740 0.459062    
countryGermany                       1.0291344  0.1012936  10.160  < 2e-16 ***
countryGreece                        0.3030441  0.0645162   4.697 2.65e-06 ***
countryGrenada                       0.1267199  0.1205178   1.051 0.293058    
countryGuatemala                     0.4477438  0.0955004   4.688 2.77e-06 ***
countryGuyana                        1.7893650  0.1148696  15.577  < 2e-16 ***
countryHungary                       1.7441202  0.0602363  28.955  < 2e-16 ***
countryIceland                       0.4077317  0.1104902   3.690 0.000225 ***
countryIreland                       0.7051437  0.0730252   9.656  < 2e-16 ***
countryIsrael                        0.9734890  0.0857189  11.357  < 2e-16 ***
countryItaly                         0.7046088  0.0936313   7.525 5.45e-14 ***
countryJamaica                      -0.0101301  0.1181480  -0.086 0.931673    
countryJapan                         1.2482093  0.1285726   9.708  < 2e-16 ***
countryKazakhstan                    1.5140052  0.0723743  20.919  < 2e-16 ***
countryKiribati                      0.3689476  0.1264285   2.918 0.003524 ** 
countryKuwait                        0.3049219  0.1182300   2.579 0.009913 ** 
countryKyrgyzstan                    0.5131143  0.0838747   6.118 9.65e-10 ***
countryLatvia                        1.4503434  0.0770427  18.825  < 2e-16 ***
countryLithuania                     1.9300974  0.0745991  25.873  < 2e-16 ***
countryLuxembourg                    1.0266482  0.0814572  12.604  < 2e-16 ***
countryMalta                         0.3542329  0.0778002   4.553 5.31e-06 ***
countryMauritius                     1.2355644  0.0986426  12.526  < 2e-16 ***
countryMexico                        0.6190524  0.1082692   5.718 1.09e-08 ***
countryMontenegro                    0.0628834  0.0710042   0.886 0.375827    
countryNetherlands                   0.7821369  0.0747173  10.468  < 2e-16 ***
countryNew Zealand                   0.9719696  0.0684545  14.199  < 2e-16 ***
countryNicaragua                     0.9345654  0.1297956   7.200 6.20e-13 ***
countryNorway                        0.5667789  0.1191750   4.756 1.99e-06 ***
countryPanama                        0.9083036  0.1172339   7.748 9.74e-15 ***
countryParaguay                      0.5843253  0.0965207   6.054 1.44e-09 ***
countryPhilippines                   0.5446347  0.1403958   3.879 0.000105 ***
countryPoland                        0.9754330  0.0854166  11.420  < 2e-16 ***
countryPortugal                      0.8361846  0.0647570  12.913  < 2e-16 ***
countryPuerto Rico                   1.1769231  0.1157236  10.170  < 2e-16 ***
countryQatar                         0.4628545  0.1354286   3.418 0.000633 ***
countryRepublic of Korea             1.5946813  0.0819820  19.452  < 2e-16 ***
countryRomania                       0.7877192  0.0674154  11.685  < 2e-16 ***
countryRussian Federation            0.9644425  0.1725238   5.590 2.29e-08 ***
countrySaint Lucia                   0.7036431  0.1198058   5.873 4.33e-09 ***
countrySaint Vincent and Grenadines  0.4564925  0.1206784   3.783 0.000156 ***
countrySerbia                        1.2614840  0.0600549  21.006  < 2e-16 ***
countrySeychelles                    0.6373098  0.1227251   5.193 2.09e-07 ***
countrySingapore                     1.7707911  0.1246566  14.205  < 2e-16 ***
countrySlovakia                      0.6093188  0.0689628   8.835  < 2e-16 ***
countrySlovenia                      1.5529088  0.0690658  22.484  < 2e-16 ***
countrySouth Africa                  0.0358944  0.0877501   0.409 0.682504    
countrySpain                         0.8377337  0.0851445   9.839  < 2e-16 ***
countrySri Lanka                     2.4989936  0.1281932  19.494  < 2e-16 ***
countrySuriname                      1.8225122  0.1160746  15.701  < 2e-16 ***
countrySweden                        0.7800047  0.1037037   7.521 5.62e-14 ***
countrySwitzerland                   1.1669437  0.0884640  13.191  < 2e-16 ***
countryThailand                      1.1080801  0.1337045   8.288  < 2e-16 ***
countryTrinidad and Tobago           1.5051819  0.1173736  12.824  < 2e-16 ***
countryTurkey                        0.1436625  0.1011627   1.420 0.155588    
countryTurkmenistan                  0.6481341  0.0550257  11.779  < 2e-16 ***
countryUkraine                       1.4168806  0.0818600  17.309  < 2e-16 ***
countryUnited Arab Emirates          0.5240639  0.1494837   3.506 0.000456 ***
countryUnited Kingdom                0.5120429  0.1003482   5.103 3.38e-07 ***
countryUnited States                 0.6277121  0.1347876   4.657 3.23e-06 ***
countryUruguay                       1.4272601  0.0675744  21.121  < 2e-16 ***
countryUzbekistan                    0.5540797  0.0569200   9.734  < 2e-16 ***
sexMale                              0.9933181  0.0074531 133.276  < 2e-16 ***
age.L                                0.4277256  0.0083328  51.330  < 2e-16 ***
age.Q                               -0.0084772  0.0083328  -1.017 0.309009    
age.C                                0.0660510  0.0083328   7.927 2.35e-15 ***
age^4                                0.0441416  0.0083328   5.297 1.19e-07 ***
continentAmericas                           NA         NA      NA       NA    
continentAsia                               NA         NA      NA       NA    
continentEurope                             NA         NA      NA       NA    
continentOceania                            NA         NA      NA       NA    
population_bine_jenks.L              0.2040991  0.0820918   2.486 0.012918 *  
population_bine_jenks.Q              0.2142950  0.0535849   3.999 6.38e-05 ***
population_bine_jenks.C              0.0181399  0.0306318   0.592 0.553728    
scaled_log_GDP_capita               -0.0797017  0.0184584  -4.318 1.58e-05 ***
min_temp_bine_jenks.L                0.0181663  0.0208988   0.869 0.384718    
min_temp_bine_jenks.Q               -0.0036666  0.0120653  -0.304 0.761208    
gdp_per_capita_bine_jenks.L          0.0367725  0.0326685   1.126 0.260336    
gdp_per_capita_bine_jenks.Q         -0.0152593  0.0180188  -0.847 0.397086    
gdp_per_capita_bine_jenks.C          0.0287815  0.0136993   2.101 0.035657 *  
scaled_avg_temp                     -0.3927445  0.0695063  -5.650 1.62e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5628 on 22702 degrees of freedom
Multiple R-squared:  0.6847,    Adjusted R-squared:  0.6832 
F-statistic: 460.8 on 107 and 22702 DF,  p-value: < 2.2e-16

Before we delve deeper into our linear regression analysis, it’s crucial to emphasize that the assumptions underpinning this model don’t need to be flawlessly met. However, severe violations can skew the model’s accuracy and lead to misleading results.

Now, let’s turn our attention to evaluating other necessary conditions for our regression model.

plot(model_suicide_ratio,1)

plot(model_log_suicide_ratio,1)

plot(model_sqrt_suicide_ratio,1)

The linearity condition doesn’t appear to be perfectly satisfied for any of our targets. However, the residuals for the log-transformed target are reasonably well-distributed and do not demonstrate any discernible patterns. On the other hand, the suicide_ratio and sqrt_suicide_ratio targets, particularly the former, do display unusual patterns in the residuals graph. Let’s continue our evaluation by assessing the next assumption of our model.

plot(model_suicide_ratio,2)

plot(model_log_suicide_ratio,2)

plot(model_sqrt_suicide_ratio,2)

library(olsrr)
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: ‘olsrr’

The following object is masked from ‘package:datasets’:

    rivers
ols_plot_resid_hist(model_suicide_ratio)

ols_plot_resid_hist(model_log_suicide_ratio)

ols_plot_resid_hist(model_sqrt_suicide_ratio)

As observed, the majority of our residuals for the log_suicide_ratio and sqrt_suicide_ratio models lie within -1 and 1, and -1.5 and 1.5 respectively, and their distributions largely follow a normal pattern. However, the suicide_ratio model doesn’t appear to satisfy these conditions as effectively.

To evaluate the final assumption - homoscedasticity of residuals - we apply the Breusch-Pagan test. The test’s null hypothesis assumes homoscedasticity. If the p-value is significant (generally, less than 0.05), it suggests a deviation from this assumption, indicating heteroscedasticity.

library(lmtest)
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric
bptest(model_suicide_ratio)

    studentized Breusch-Pagan test

data:  model_suicide_ratio
BP = 2875.6, df = 107, p-value < 2.2e-16
bptest(model_log_suicide_ratio)

    studentized Breusch-Pagan test

data:  model_log_suicide_ratio
BP = 3430.8, df = 107, p-value < 2.2e-16
bptest(model_sqrt_suicide_ratio)

    studentized Breusch-Pagan test

data:  model_sqrt_suicide_ratio
BP = 4770, df = 107, p-value < 2.2e-16
plot(model_log_suicide_ratio,3)

plot(model_sqrt_suicide_ratio,3)

The plot visualizes the residuals’ variance in relation to the predictors. Ideally, the residuals should be randomly scattered around the centerline, signifying homoscedasticity.

In our case, residuals are somewhat evenly distributed, indicating violation of homoscedasticity. This suggests our model probabiy be less accurate across the predictor range, but it does not drastically impact the overall model’s reliability. we can conclude that suicide_rate not a good candidate for target value.

4.3.1 Feature selection

Feature selection, also known as variable selection, attribute selection, or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. The goal of feature selection is three-fold:

*Improving Model Performance: When irrelevant or partially relevant features are used to construct a predictive model, the accuracy of the model can be significantly degraded. By selecting only the most relevant features to use in model construction, we can enhance the predictive accuracy of the model.

*Reducing Overfitting: Too many features in the model can lead to overfitting, where the model performs well on the training data but poorly on unseen data. By reducing the number of features, we can make the model more generalizable.

*Enhancing Interpretability: Models with fewer features are simpler and easier to interpret.

Reducing Training Time: Fewer features mean faster training times.

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable. Not all features are created equal. Some are relevant to the target variable, some are irrelevant, and some are redundant. Feature selection enables us to focus on the relevant and non-redundant features, increasing our model’s performance and interpretability.

Feature selection methods:

Forward Selection: You start with an empty model and add predictors one by one. In each step, you add the variable that gives the most significant improvement to the model.

Backward Selection: You start with the full model and remove predictors one by one. In each step, you remove the variable that is the least significant.

Mixed Selection: This is a combination of forward and backward selection. You start with an empty model, add variables as in forward selection, but after adding each new variable, the method may also remove variables that do not contribute to the model fit. in this project we use criteria like RSS, adjr2, Mallow’s Cp (cp) and Bayesian Information Criterion (BIC). Residual Sum of Squares (RSS): This is a measure of the discrepancy between the data and an estimation model. A small RSS indicates a tight fit of the model to the data.

Adjusted R-squared (adjr2): It is a modification of R-squared that adjusts for the number of predictors in the model. Unlike R-squared, the adjusted R-squared increases only if the new term enhances the model more than would be expected by chance.

Mallow’s Cp (cp): This criterion attempts to identify a model with a balance between under-fitting and over-fitting. Its ideal value is p (the number of predictors in the model), and a good model is a model where Cp is nearly equal to its p-value.

Bayesian Information Criterion (BIC): This criterion deals with model selection problems. Lower BIC means better model.

log_suicide_ratio

library(leaps)

# First, fit a full model
full_model <- model_log_suicide_ratio

#log_suicide_ratio
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))



# Forward Selection
forward_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "forward")
Warning: 5  linear dependencies found
Reordering variables and trying again:
forward_summary <- summary(forward_model_log)

# Backward Selection
backward_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "backward")
Warning: 5  linear dependencies found
Reordering variables and trying again:
backward_summary <- summary(backward_model_log)

# Mixed (stepwise) selection
stepwise_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "seqrep")
Warning: 5  linear dependencies found
Reordering variables and trying again:
stepwise_summary <- summary(stepwise_model_log)

# Create a dataframe with the criteria for each method
comparison_df_log <- data.frame(
  Method = c("Forward", "Backward", "Mixed"),
  RSS = c(forward_summary$rss[which.min(forward_summary$cp)], 
          backward_summary$rss[which.min(backward_summary$cp)], 
          stepwise_summary$rss[which.min(stepwise_summary$cp)]),
  AdjustedR2 = c(max(forward_summary$adjr2), max(backward_summary$adjr2), max(stepwise_summary$adjr2)),
  Cp = c(min(forward_summary$cp), min(backward_summary$cp), min(stepwise_summary$cp)),
  BIC = c(min(forward_summary$bic), min(backward_summary$bic), min(stepwise_summary$bic))
)


print(comparison_df_log)
NA

forward has higher adjustedR2 and lower BIC for log.

library(leaps)

# First, fit a full model
full_model <- model_sqrt_suicide_ratio

#log_suicide_ratio
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

  

# Forward Selection
forward_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "forward")
Warning: 5  linear dependencies found
Reordering variables and trying again:
forward_summary <- summary(forward_model_sqrt)

# Backward Selection
backward_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "backward")
Warning: 5  linear dependencies found
Reordering variables and trying again:
backward_summary <- summary(backward_model_sqrt)

# Mixed (stepwise) selection
stepwise_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "seqrep")
Warning: 5  linear dependencies found
Reordering variables and trying again:
stepwise_summary <- summary(stepwise_model_sqrt)

# Create a dataframe with the criteria for each method
comparison_df_sqrt <- data.frame(
  Method = c("Forward", "Backward", "Mixed"),
  RSS = c(forward_summary$rss[which.min(forward_summary$cp)], 
          backward_summary$rss[which.min(backward_summary$cp)], 
          stepwise_summary$rss[which.min(stepwise_summary$cp)]),
  AdjustedR2 = c(max(forward_summary$adjr2), max(backward_summary$adjr2), max(stepwise_summary$adjr2)),
  Cp = c(min(forward_summary$cp), min(backward_summary$cp), min(stepwise_summary$cp)),
  BIC = c(min(forward_summary$bic), min(backward_summary$bic), min(stepwise_summary$bic))
)


print(comparison_df_sqrt)
NA

forward has higher adjustedR2 and lower BIC for both log and sqrt but its much lesser than original model

var1<-c(scaled_log_suicide_ratio_var,"scaled_log_suicide_ratio")
data3<-data2 %>% select(one_of(var1))
trainData_log <- data3 %>% filter(year <=2010)
testData_log <- data3 %>% filter(year >2010)
var1<-c(scaled_sqrt_suicide_ratio_var,"scaled_sqrt_suicide_ratio")
data3<-data2 %>% select(one_of(var1))
trainData_sqrt <- data3 %>% filter(year <=2010)
testData_sqrt <- data3 %>% filter(year >2010)

just for seeing how much our works on data is been influential on performance of model we inspect initial data performance on models too

initial_var<-c("country","year","sex","age","suicide_ratio","GDP_for_year","GDP_per_capita","generation","continent","life_exp","avg_temp","max_temp","min_temp")
data3<-data %>% select(one_of(initial_var))
trainData_initial <- data3 %>% filter(year <=2010)
testData_initial <- data3 %>% filter(year >2010)

4.4.1 Simple linear model

first we train a model for target log

library(Metrics)
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_log)
data_test <- testData_log
# Make predictions on the testing data
predictions <- predict(model, newdata = testData_log)
Warning: prediction from a rank-deficient fit may be misleading
target <-"scaled_log_suicide_ratio"
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_log$scaled_log_suicide_ratio, predictions)
print(paste0("MSE: ", mse))
[1] "MSE: 0.259606519647617"
# Calculate R-squared
sse = sum((predictions - testData_log$scaled_log_suicide_ratio)^2)
sst = sum((testData_log$scaled_log_suicide_ratio - mean(testData_log$scaled_log_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))
[1] "R-squared: 0.713841021873424"
# Calculate Adjusted R-squared
n = length(testData_log$scaled_log_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))
[1] "Adjusted R-squared: 0.705274774146334"
library(Metrics)
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_sqrt)

# Make predictions on the testing data
predictions <- predict(model, newdata = testData_sqrt)
Warning: prediction from a rank-deficient fit may be misleading
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_sqrt$scaled_sqrt_suicide_ratio, predictions)
print(paste0("MSE: ", mse))
[1] "MSE: 0.278237222189494"
# Calculate R-squared
sse = sum((predictions - testData_sqrt$scaled_sqrt_suicide_ratio)^2)
sst = sum((testData_sqrt$scaled_sqrt_suicide_ratio - mean(testData_sqrt$scaled_sqrt_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))
[1] "R-squared: 0.657814425097907"
# Calculate Adjusted R-squared
n = length(testData_sqrt$scaled_sqrt_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))
[1] "Adjusted R-squared: 0.647571005784495"

initial_data

library(Metrics)
a <-c("country","year","sex","age","GDP_for_year","GDP_per_capita","generation","continent","life_exp","avg_temp","max_temp","min_temp")
formula <- as.formula(paste("suicide_ratio", "~", paste(a, collapse = " + ")))

model <- lm(formula, data = trainData_initial)

# Make predictions on the testing data
predictions <- predict(model, newdata = testData_initial)
Warning: prediction from a rank-deficient fit may be misleading
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_initial$suicide_ratio, predictions)
print(paste0("MSE: ", mse))
[1] "MSE: 143.188698245645"
# Calculate R-squared
sse = sum((predictions - testData_initial$suicide_ratio)^2)
sst = sum((testData_initial$suicide_ratio - mean(testData_initial$suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))
[1] "R-squared: 0.485732100309933"
# Calculate Adjusted R-squared
n = length(testData_initial$suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))
[1] "Adjusted R-squared: 0.470337349267431"

it shows that our hard work was effective ### 4.4.2 Lasso regression model Lasso regression is a type of regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. The term Lasso is an acronym for Least Absolute Shrinkage and Selection Operator.

The Lasso method introduces a penalty term to the loss function of the linear regression model that is the absolute value of the magnitude of the coefficient values, or simply the absolute value of each coefficient. lasso for log target


library(caret)
Loading required package: lattice

Attaching package: ‘caret’

The following objects are masked from ‘package:Metrics’:

    precision, recall

The following object is masked from ‘package:purrr’:

    lift
library(glmnet)
Loading required package: Matrix

Attaching package: ‘Matrix’

The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack

Loaded glmnet 4.1-7
x_train <- model.matrix(scaled_log_suicide_ratio~., trainData_log)[,-1] # Exclude intercept column
y_train <- trainData_log$scaled_log_suicide_ratio
x_test <- model.matrix(scaled_log_suicide_ratio~., testData_log)[,-1] # Exclude intercept column
y_test <- testData_log$scaled_log_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 1, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
mse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", mse))
[1] "MSE on the test set:  0.260504473396823"
rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
[1] "RSQ test : 0.712851225748016"
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
[1] "Adjuster R squre : 0.704095745043625"

for sqre target


library(caret)
library(glmnet)


x_train <- model.matrix(scaled_sqrt_suicide_ratio~., trainData_sqrt)[,-1] # Exclude intercept column
y_train <- trainData_sqrt$scaled_sqrt_suicide_ratio
x_test <- model.matrix(scaled_sqrt_suicide_ratio~., testData_sqrt)[,-1] # Exclude intercept column
y_test <- testData_sqrt$scaled_sqrt_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 1, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
mse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", mse))
[1] "MSE on the test set:  0.279968871779737"
rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
[1] "RSQ test : 0.655684783686511"
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
[1] "Adjuster R squre : 0.64518623553664"

4.4.3 Ridge regression

Ridge regression, also known as Tikhonov regularization, is a regularization technique designed to deal with multicollinearity, improve prediction accuracy, and interpretability of the statistical model it is applied to. Ridge regression performs “L2 regularization,” which means that it adds a penalty equivalent to the square of the magnitude of the coefficients. This results in smaller coefficients, which makes the model less complex and better at generalizing from the training data to unseen data.

for log target


library(caret)
library(glmnet)


x_train <- model.matrix(scaled_log_suicide_ratio~., trainData_log)[,-1] # Exclude intercept column
y_train <- trainData_log$scaled_log_suicide_ratio
x_test <- model.matrix(scaled_log_suicide_ratio~., testData_log)[,-1] # Exclude intercept column
y_test <- testData_log$scaled_log_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 0, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
rmse <-mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", rmse))
[1] "MSE on the test set:  0.261420531902801"
rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
[1] "RSQ test : 0.711841473118035"
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
[1] "Adjuster R squre : 0.70305520394975"

for sqrt target

x_train <- model.matrix(scaled_sqrt_suicide_ratio~., trainData_sqrt)[,-1] # Exclude intercept column
y_train <- trainData_sqrt$scaled_sqrt_suicide_ratio
x_test <- model.matrix(scaled_sqrt_suicide_ratio~., testData_sqrt)[,-1] # Exclude intercept column
y_test <- testData_sqrt$scaled_sqrt_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 0, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
rmse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", rmse))
[1] "MSE on the test set:  0.275760521328236"
rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
[1] "RSQ test : 0.660860355837873"
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
[1] "Adjuster R squre : 0.650519616552844"
vec1 <-c("Simple Linear",0.2596,0.7138,0.7052,"log_suicide_ratio")
vec2<-c("Simple Linear",0.278,0.6578,0.6475,"sqrt_suicide_ratio")
vec3<-c("Simple Linear",143.18,0.485,0.4703,"suicide_ratio")
vec4<-c("Lasso",0.2604,0.7129,0.7042,"log_suicide_ratio")
vec5<-c("Lasso",0.2798,0.655,0.6453,"sqrt_suicide_ratio")
vec6<-c("Ridge",0.2615,"0.7116","0.7028","log_suicide_ratio")
vec7<- c("Ridge",0.2758,0.6607,0.6503,"sqrt_suicide_ratio")

5.Conclusions

df <- data.frame(Model=rep(NA,7), 
                 Mean_Squared_Error=rep(NA,7), 
                 R_squared=rep(NA,7), 
                 Adjusted_R_squared=rep(NA,7), 
                 Target=rep(NA,7))
df[1, ] <- vec1
df[2, ] <- vec2
df[3, ] <- vec3
df[4, ] <- vec4
df[5, ] <- vec5
df[6, ] <- vec6
df[7, ] <- vec7
df

The simple linear regression model with log as the target slightly outperforms the others. However, simple linear, Lasso, and Ridge regressions with log as the target demonstrated quite similar performance. To distinguish more effectively between these models, we should employ cross-validation techniques.

Now, let’s explore the importance of each feature in the simple linear regression model, where the target is ‘log_suicide_ratio’. This will give us more insight into the significant predictors in our model.

library(Metrics)
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_log)
data_test <- testData_log
# Make predictions on the testing data
predictions <- predict(model, newdata = testData_log)
Warning: prediction from a rank-deficient fit may be misleading
target <-"scaled_log_suicide_ratio"
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_log$scaled_log_suicide_ratio, predictions)
print(paste0("MSE: ", mse))
[1] "MSE: 0.259606519647617"
# Calculate R-squared
sse = sum((predictions - testData_log$scaled_log_suicide_ratio)^2)
sst = sum((testData_log$scaled_log_suicide_ratio - mean(testData_log$scaled_log_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))
[1] "R-squared: 0.713841021873424"
# Calculate Adjusted R-squared
n = length(testData_log$scaled_log_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))
[1] "Adjusted R-squared: 0.705274774146334"
summary(model)

Call:
lm(formula = formula, data = trainData_log)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8964 -0.2737  0.0349  0.2937  2.5030 

Coefficients: (5 not defined because of singularities)
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          8.6339637  1.9861361   4.347 1.39e-05 ***
year                                -0.0051372  0.0009847  -5.217 1.84e-07 ***
countryAntigua and Barbuda           2.0701317  0.1211419  17.088  < 2e-16 ***
countryArgentina                     1.0611908  0.0789824  13.436  < 2e-16 ***
countryArmenia                      -0.2411279  0.0567482  -4.249 2.16e-05 ***
countryAruba                         2.2125428  0.1356740  16.308  < 2e-16 ***
countryAustralia                     1.3983657  0.0959024  14.581  < 2e-16 ***
countryAustria                       1.4884649  0.0820656  18.137  < 2e-16 ***
countryAzerbaijan                   -0.6248443  0.0518645 -12.048  < 2e-16 ***
countryBahamas                       1.0835249  0.1160726   9.335  < 2e-16 ***
countryBahrain                       1.0136539  0.1238993   8.181 2.99e-16 ***
countryBarbados                      1.1443311  0.1220616   9.375  < 2e-16 ***
countryBelarus                       1.6274245  0.0688722  23.630  < 2e-16 ***
countryBelgium                       1.5312773  0.0699516  21.891  < 2e-16 ***
countryBelize                        1.6748394  0.1085822  15.425  < 2e-16 ***
countryBrazil                        0.6348542  0.2115340   3.001 0.002693 ** 
countryBulgaria                      1.2786532  0.0489592  26.117  < 2e-16 ***
countryCanada                        0.8728859  0.1602775   5.446 5.21e-08 ***
countryChile                         0.6725118  0.0552855  12.164  < 2e-16 ***
countryColombia                      0.6409490  0.1201195   5.336 9.62e-08 ***
countryCosta Rica                    0.8822440  0.1126480   7.832 5.06e-15 ***
countryCroatia                       1.5704023  0.0563754  27.856  < 2e-16 ***
countryCuba                          1.8129338  0.1115448  16.253  < 2e-16 ***
countryCyprus                        0.3776505  0.0901405   4.190 2.81e-05 ***
countryCzech Republic                1.2561592  0.0643312  19.526  < 2e-16 ***
countryDenmark                       0.5409765  0.2418171   2.237 0.025289 *  
countryEcuador                       0.7367235  0.0865224   8.515  < 2e-16 ***
countryEl Salvador                   1.2255700  0.1079683  11.351  < 2e-16 ***
countryEstonia                       1.6255610  0.0788501  20.616  < 2e-16 ***
countryFiji                          1.0003509  0.1138688   8.785  < 2e-16 ***
countryFinland                       1.4448523  0.1079861  13.380  < 2e-16 ***
countryFrance                        1.6970096  0.1047866  16.195  < 2e-16 ***
countryGeorgia                      -0.1103127  0.0587077  -1.879 0.060258 .  
countryGermany                       1.3118352  0.1095124  11.979  < 2e-16 ***
countryGreece                        0.0775424  0.0609030   1.273 0.202958    
countryGrenada                       1.9601590  0.1217486  16.100  < 2e-16 ***
countryGuatemala                     0.1102693  0.0947247   1.164 0.244397    
countryGuyana                        1.9571851  0.1148735  17.038  < 2e-16 ***
countryHungary                       1.7938440  0.0569404  31.504  < 2e-16 ***
countryIceland                       1.2192789  0.1095099  11.134  < 2e-16 ***
countryIreland                       0.9163185  0.0683086  13.414  < 2e-16 ***
countryIsrael                        0.9942509  0.0837859  11.867  < 2e-16 ***
countryItaly                         0.8276146  0.1030944   8.028 1.05e-15 ***
countryJamaica                      -0.7837046  0.1175467  -6.667 2.68e-11 ***
countryJapan                         1.5921159  0.1856974   8.574  < 2e-16 ***
countryKazakhstan                    1.7129881  0.0707859  24.200  < 2e-16 ***
countryKiribati                      2.4574776  0.1241918  19.788  < 2e-16 ***
countryKuwait                        0.4778700  0.1182609   4.041 5.35e-05 ***
countryKyrgyzstan                    0.9504617  0.0840938  11.302  < 2e-16 ***
countryLatvia                        1.6583482  0.0752545  22.037  < 2e-16 ***
countryLithuania                     1.9585877  0.0724760  27.024  < 2e-16 ***
countryLuxembourg                    1.4596508  0.0773800  18.863  < 2e-16 ***
countryMalta                         0.9363929  0.0753984  12.419  < 2e-16 ***
countryMauritius                     1.5074685  0.0983495  15.328  < 2e-16 ***
countryMexico                        0.3567201  0.1164615   3.063 0.002194 ** 
countryMontenegro                    0.4716294  0.0624739   7.549 4.58e-14 ***
countryNetherlands                   0.9901728  0.0704670  14.052  < 2e-16 ***
countryNew Zealand                   1.2284615  0.0635832  19.321  < 2e-16 ***
countryNicaragua                     0.9602198  0.1880365   5.107 3.31e-07 ***
countryNorway                        0.9817333  0.1183265   8.297  < 2e-16 ***
countryPanama                        0.8308842  0.1181729   7.031 2.12e-12 ***
countryParaguay                      0.4410477  0.0962993   4.580 4.68e-06 ***
countryPhilippines                   0.0931614  0.1477824   0.630 0.528443    
countryPoland                        1.2150773  0.0869009  13.982  < 2e-16 ***
countryPortugal                      0.8521113  0.0605372  14.076  < 2e-16 ***
countryPuerto Rico                   0.9889828  0.1165556   8.485  < 2e-16 ***
countryQatar                         1.0969368  0.1374820   7.979 1.56e-15 ***
countryRepublic of Korea             1.6383170  0.0823173  19.902  < 2e-16 ***
countryRomania                       1.0464677  0.0714912  14.638  < 2e-16 ***
countryRussian Federation            1.5026077  0.2184321   6.879 6.21e-12 ***
countrySaint Lucia                   1.8441775  0.1207396  15.274  < 2e-16 ***
countrySaint Vincent and Grenadines  2.0450841  0.1219296  16.773  < 2e-16 ***
countrySerbia                        1.4364946  0.0583010  24.639  < 2e-16 ***
countrySeychelles                    2.2898435  0.1246346  18.372  < 2e-16 ***
countrySingapore                     1.7089927  0.1255121  13.616  < 2e-16 ***
countrySlovakia                      0.6189721  0.0653224   9.476  < 2e-16 ***
countrySlovenia                      1.7136539  0.0664696  25.781  < 2e-16 ***
countrySouth Africa                 -0.8752966  0.0889033  -9.845  < 2e-16 ***
countrySpain                         0.9397352  0.0855540  10.984  < 2e-16 ***
countrySri Lanka                     2.2730792  0.1260424  18.034  < 2e-16 ***
countrySuriname                      2.0132034  0.1165673  17.271  < 2e-16 ***
countrySweden                        1.1350470  0.1022006  11.106  < 2e-16 ***
countrySwitzerland                   1.4597035  0.0856385  17.045  < 2e-16 ***
countryThailand                      1.1267797  0.1421507   7.927 2.38e-15 ***
countryTrinidad and Tobago           1.6136069  0.1173823  13.747  < 2e-16 ***
countryTurkey                       -0.1398765  0.1418454  -0.986 0.324087    
countryTurkmenistan                  0.9692461  0.0502468  19.290  < 2e-16 ***
countryUkraine                       1.6383462  0.0857071  19.116  < 2e-16 ***
countryUnited Arab Emirates          0.3837009  0.1467405   2.615 0.008934 ** 
countryUnited Kingdom                0.7563182  0.1081298   6.995 2.75e-12 ***
countryUnited States                 0.9927339  0.1865643   5.321 1.04e-07 ***
countryUruguay                       1.4711658  0.0644911  22.812  < 2e-16 ***
countryUzbekistan                    0.7692677  0.0514785  14.943  < 2e-16 ***
sexMale                              1.0290630  0.0070749 145.453  < 2e-16 ***
age.L                                0.6570340  0.0079100  83.064  < 2e-16 ***
age.Q                                0.1526387  0.0079100  19.297  < 2e-16 ***
age.C                                0.1283105  0.0079100  16.221  < 2e-16 ***
age^4                                0.0089594  0.0079100   1.133 0.257367    
continentAmericas                           NA         NA      NA       NA    
continentAsia                               NA         NA      NA       NA    
continentEurope                             NA         NA      NA       NA    
continentOceania                            NA         NA      NA       NA    
population_bine_jenks.L              0.0330934  0.1321995   0.250 0.802335    
population_bine_jenks.Q              0.1695602  0.0827496   2.049 0.040468 *  
population_bine_jenks.C                     NA         NA      NA       NA    
scaled_log_GDP_capita               -0.0605127  0.0181248  -3.339 0.000843 ***
min_temp_bine_jenks.L                0.0011148  0.0201592   0.055 0.955898    
min_temp_bine_jenks.Q               -0.0054622  0.0116370  -0.469 0.638801    
gdp_per_capita_bine_jenks.L          0.0617656  0.0315224   1.959 0.050079 .  
gdp_per_capita_bine_jenks.Q         -0.0241925  0.0174193  -1.389 0.164900    
gdp_per_capita_bine_jenks.C          0.0089384  0.0130658   0.684 0.493914    
scaled_avg_temp                     -0.2252791  0.0713174  -3.159 0.001587 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4875 on 18883 degrees of freedom
Multiple R-squared:  0.7674,    Adjusted R-squared:  0.7661 
F-statistic: 587.6 on 106 and 18883 DF,  p-value: < 2.2e-16

as we can see the most important feature in order are:

  1. sex
  2. country
  3. age
  4. scaled_avg_temp
  5. scaled_log_GDP_capita
  6. population_bine_jenks
---
title: "Statistical Learning"
author: "<i>Amin Almasi, Asma Hoseinpour </i>"
output:
  html_notebook:
    toc: yes
    toc_float: true
  html_document:
    toc: yes
    toc_float: true
    df_print: paged
    highlight: tango
---
```{r}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

```

```{r}
library(tidyverse) # general
library(ggalt) # dumbbell plots
library(grid) # plots
library(gridExtra) # plots
library(ggcorrplot) 
library(ggplot2)
library(rworldmap) # quick country-level heat map
library(countrycode) # continent
library(broom) # significant trends within countries
library(lubridate) # dealing with dates
library(car)
library(scales)
```

# 1.Obtaining Data 

## 1.1 Suicide Data

The dataset was sourced from [links](https://www.kaggle.com/datasets/russellyates88/suicide-rates-overview-1985-to-2016?datasetId=85351&sortBy=voteCount&language=R) and contains a comprehensive collection of suicide statistics broken down by country, year, age, and sex. This data has been gathered from multiple datasets with the intention of identifying trends and patterns in global suicide rates. Here are the features of this dataset:

* **Country**: The country where the data was recorded.

* **Year**: The year when the data was recorded. The data was collect from 1985 until 2016.

* **Sex**: The sex of the individuals included in the suicide count.

* **Age**: The age group of the individuals included in the suicide count.

* **Suicide_no**: The total number of suicides recorded in a specific country, year, age group, and sex.

* **Population**: The total population of the specific age and sex group in the country for that year.

* **Suicide/ 100k pop**: This is a derived metric representing the number of suicides per 100,000 people in the population, calculated as (Suicide_no / Population) * 100000.

* **HDI_for_year**: Human Development Index, a statistic composite index of life expectancy, education, and per capita income indicators.

* **GDP_for_year ($)**: The gross domestic product (GDP) of the country for the specified year, measured in US dollars.

* **GDP_per_capita (\$)**: The GDP per capita of the country for the specified year, also measured in US dollars. It is calculated by dividing the GDP_for_year ($) by the total population of the country.

* **Generation**: The generation cohort of the individuals included in the suicide count (e.g., Gen X, Boomers, etc.).

```{r}
suicide_data <- read_csv("suicide_data.csv", show_col_types = FALSE)
head(suicide_data)
```
```{r}
dim(suicide_data)
```
We further supplemented our analysis by incorporating additional features, which we hypothesize may significantly impact suicide rates.

## 1.2 Continet

We've enriched our dataset by appending a 'continent' feature corresponding to each country. This enhancement, accomplished utilizing the 'countrycode' library, will facilitate subsequent geographical analyses.

```{r}
# getting continent data:
suicide_data$continent <- countrycode(sourcevar = suicide_data$country,
                              origin = "country.name",
                              destination = "continent")

dim(suicide_data)
```


## 1.3 Life Expectancy

We also added 'life expectancy' data from [links](https://databank.worldbank.org/). Remember, the Human Development Index (HDI) also measures life expectancy, so these two features might be highly correlated. We'll keep this in mind for our analysis.

```{r}
life_exp_data <-read_csv('life_exp.csv',show_col_types = FALSE)
head(life_exp_data)
```


```{r}
life_exp_data$Time <- as.integer(life_exp_data$Time)

life_exp_data <- life_exp_data %>%
  rename( life_exp = `Value`,
          year = `Time`,
          country = `Country Name`) %>%
  as.data.frame()
```

We've identified that the format of some country names differs between the suicide data and the life expectancy dataset. To prevent any further discrepancies, it's imperative that we address and rectify these inconsistencies.

```{r}
name_changes_life <- c("Bahamas, The" = "Bahamas", 
                  "Czechia" = "Czech Republic",
                  "Kyrgyz Republic" = "Kyrgyzstan",
                  "Macao SAR, China" = "Macau",
                  "Korea, Rep." = "Republic of Korea",
                  "St. Kitts and Nevis" = "Saint Kitts and Nevis",
                  "St. Lucia" = "Saint Lucia",
                  "St. Vincent and the Grenadines" = "Saint Vincent and Grenadines",
                  "Slovak Republic" = "Slovakia",
                  "Turkiye" = "Turkey"
                  )

life_exp_data$country[life_exp_data$country %in% names(name_changes_life)] <- 
    name_changes_life[life_exp_data$country[life_exp_data$country %in% names(name_changes_life)]]
```

```{r}
data <- suicide_data %>%
  left_join(life_exp_data[, c('year', 'country', 'life_exp')], 
            by = c('year', 'country'))
```


```{r}
dim(data)
```
## 1.4 Temperature

Our investigation suggests that temperature might influence suicide rates.(source: [links](https://annals-general-psychiatry.biomedcentral.com/articles/10.1186/s12991-016-0106-2#ref-CR47)) So, we incorporated temperature data into our dataset from [link](https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data?sort=votes) to further explore this possibility. This dataset covers temperature data since 1750 till 2013. We include three additional features:

* **max_temp**: The highest monthly temperature for each year.
* **min_temp**: The lowest monthly temperature for each year.
* **avg_temp**: The average monthly temperature for each year.

```{r}
temp_data <- read_csv("GlobalLandTemperaturesByCountry.csv",show_col_types = FALSE)
head(temp_data)
```

```{r}
# Extract year and month from date_column
temp_data$year <- year(temp_data$dt)
temp_data$month <- month(temp_data$dt)

temp_data <- select(temp_data, -dt, -"AverageTemperatureUncertainty")
temp_data<- rename(temp_data, country = Country)

temp_data$year <-as.integer(temp_data$year)
temp_data$month <-as.integer(temp_data$month)
temp_data$AverageTemperature <-as.numeric(temp_data$AverageTemperature)

temp_data <- temp_data %>%
  filter(year >= 1985 & year <= 2016)

temp_data <- temp_data %>%
  group_by(country, year) %>%
  mutate(
    avg_temp = sum(AverageTemperature, na.rm = TRUE) / sum(!is.na(AverageTemperature)),
    max_temp = max(AverageTemperature, na.rm = TRUE),
    min_temp = min(AverageTemperature, na.rm = TRUE)
  ) %>%
  ungroup()

temp_data <- select(temp_data, -month, -"AverageTemperature")

temp_data <-distinct(temp_data)

temp_data <- temp_data %>%
  filter_all(all_vars(!is.infinite(.)))

```

As with the life expectancy data, we've noticed discrepancies in the formatting of country names between the suicide data and temperature dataset. To avoid any potential issues down the line, it's crucial that we reconcile these inconsistencies and standardize the country names across all datasets.

```{r}
name_changes_temp <- c("Antigua And Barbuda" = "Antigua and Barbuda", 
                  "Bosnia And Herzegovina" = "Bosnia and Herzegovina",
                  "South Korea" = "Republic of Korea",
                  "Russia" = "Russian Federation",
                  "Saint Kitts And Nevis" = "Saint Kitts and Nevis",
                  "Trinidad And Tobago" = "Trinidad and Tobago",
                  "Saint Vincent And The Grenadines" = "Saint Vincent and Grenadines"
                  )
temp_data$country[temp_data$country %in% names(name_changes_temp)] <- 
    name_changes_temp[temp_data$country[temp_data$country %in% names(name_changes_temp)]]
```

```{r}
data <- data %>%
  left_join(temp_data[, c('year', 'country', 'avg_temp', 'max_temp', 'min_temp')], 
            by = c('year', 'country'))
```

```{r}
dim(data)
```
Now that we've collected all the necessary data, our next step is to clean and preprocess this data for further analysis.


# 2.Clean and Filter Data

```{r}
glimpse(head(data, 8))
```
```{r}
print(colnames(data))
```

```{r}
sapply(data, function(x) length(unique(x)))
```
## 2.1 Columns and Values

```{r}
data <- data %>% 
  rename(suicide_ratio = `suicides/100k pop`, 
         country_year = `country-year`,
         HDI_for_year = `HDI for year`,
         GDP_for_year = `gdp_for_year ($)`, 
         GDP_per_capita = `gdp_per_capita ($)`) %>%
  as.data.frame()

data$age <- gsub(" years", "", data$age)

data$sex <- ifelse(data$sex == "male", "Male", "Female")
```

## 2.2 Missing Values

### 2.2.1 Data Scarcity by Country/Year

In an ideal dataset, every unique combination of country and year (country_year) would be represented by 12 entries (2 genders across 6 age groups). Now, we need to verify the completeness of our data for each country_year combination.

```{r}
data %>%
  group_by(country_year) %>%
  count() %>% #this SHOULD give 12 rows for every county-year combination (6 age bands * 2 gender)
  filter(n != 12)
```
It appears that there is problem with 2016 data.

```{r}
year_value_counts <- as.data.frame(sort(table(data$year), decreasing = FALSE))
names(year_value_counts) <- c("Year", "Count")
head(year_value_counts,5)
```
Our exploration reveals that the dataset for the year 2016 is not only sparse, but also incomplete for the few countries that have entries. Additionally, data for the years between 1985 and 1989 are also quite limited. These issues need to be addressed.
As a solution, we've decided to exclude the data from 2016. We also drop the 'country_year' column from our dataset.

```{r}
data <- data %>%
  filter(year != 2016) %>% # I therefore exclude 2016 data
  select(-country_year)
```

In the following step, we focus on filtering our dataset to ensure its robustness for further analysis. Specifically, we are addressing the issue of certain countries that have insufficient data spread across the years. These sparse data points can potentially skew our analysis or generate inaccurate insights. Therefore, we will systematically remove such countries from our dataset to maintain data integrity and reliability for subsequent steps in our study.

```{r}
minimum_years <- data %>%
  group_by(country) %>%
  summarize(rows = n(), 
            years = rows / 12) %>%
  arrange(years)

minimum_years <- minimum_years %>%
  filter(minimum_years$years<=3)
  

data <- data %>%
filter(!(country %in% minimum_years$country))

dim(data)
```

```{r}
sapply(data, function(x) length(unique(x)))
```
We've further refined our dataset, eliminating data from 2016 and from eight countries due to their sparse or incomplete data. This ensures a more reliable basis for our analysis

### 2.2.2 NA Values 

Once we've eliminated the incomplete data, we'll proceed to inspect each feature for the presence of null values.

```{r}
na_counts <- sapply(data, function(x) sum(is.na(x))/nrow(data)*100)
na_counts_df <- data.frame(Feature = names(na_counts), NA_ratio = na_counts)
na_counts_df = na_counts_df %>% `rownames<-`( NULL )
print(na_counts_df)
```

Approximately 70% of the 'HDI_for_year' column contains null values, necessitating an adjustment. Despite comprehensive exploration, we were unable to find any reliable data to fill these null values in the HDI. Additionally, since the formula for calculating the HDI changed in 2010, the index before and after this year is not directly comparable. As we've added life expectancy data for each year, the decision has been made to drop the 'HDI_for_year' column.

```{r}
data = subset(data, select = -c(HDI_for_year) )
```
Approximately 7.5% of the temperature data consists of null values which requires addressing. As a first step, we'll identify the countries that contain null values in their temperature data.

```{r}
data %>% 
  group_by(country) %>% 
  filter(all(is.na(min_temp))) %>% 
  pull(country) %>% 
  unique()
```
There's only one country, the Maldives (located in South Asia), for which temperature data is unavailable. Given that we have ample data for Asia, we've made the decision to exclude the Maldives from our dataset.

```{r}
data %>%
  group_by(continent) %>%
  summarise(num_countries = n_distinct(country))
```

```{r}
countries_to_remove <- c("Maldives")

data <- data[!data$country %in% countries_to_remove, ]
```

Given that our temperature data concludes in 2013, we will fill the 'avg_temp', 'min_temp', and 'max_temp' fields for the years 2014 and 2015 using the corresponding data from 2013. 

We observed that the data for Ukraine in 2013 is missing from our suicide dataset. Consequently, to address this absence, we will use the data from 2012 for this particular country to approximate the values for 2014 and 2015.


```{r}
df_2013 <- data %>%
  filter(year == 2013) %>%
  select(country, avg_temp, min_temp, max_temp)

names(df_2013)[2:4] <- paste0(names(df_2013)[2:4], "_2013")

df_2012 <- data %>%
  filter(year == 2012) %>%
  select(country, avg_temp, min_temp, max_temp)

names(df_2012)[2:4] <- paste0(names(df_2012)[2:4], "_2012")

# Replace NA values in 2014 and 2015 using the lookup table
data <- data %>%
  mutate(year = as.character(year)) %>%
  rowwise() %>%
  mutate(
    avg_temp = ifelse(year %in% c("2014", "2015") & is.na(avg_temp) & country != "Ukraine",
                      df_2013$avg_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(avg_temp) & country == "Ukraine",
                             df_2012$avg_temp_2012[df_2012$country == country],
                             avg_temp)),
    min_temp = ifelse(year %in% c("2014", "2015") & is.na(min_temp) & country != "Ukraine",
                      df_2013$min_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(min_temp) & country == "Ukraine",
                             df_2012$min_temp_2012[df_2012$country == country],
                             min_temp)),
    max_temp = ifelse(year %in% c("2014", "2015") & is.na(max_temp) & country != "Ukraine",
                      df_2013$max_temp_2013[df_2013$country == country],
                      ifelse(year %in% c("2014", "2015") & is.na(max_temp) & country == "Ukraine",
                             df_2012$max_temp_2012[df_2012$country == country],
                             max_temp))
  ) %>%
  ungroup()
```

```{r}
data <- data %>%
  mutate(year = as.integer(year))
```

```{r}
dim(data)
```

```{r}
na_counts <- sapply(data, function(x) sum(is.na(x))/nrow(data)*100)
na_counts_df <- data.frame(Feature = names(na_counts), NA_ratio = na_counts)
na_counts_df = na_counts_df %>% `rownames<-`( NULL )
print(filter(na_counts_df, NA_ratio>0))
```

With all missing values effectively handled, our dataset is now clean and ready for further analysis. 


## 2.3 Factorizing Categorical Data

```{r}
# Nominal factors
data_nominal <- c('country', 'sex', 'continent')

data[data_nominal] <- lapply(data[data_nominal], function(x){factor(x)})


# Making age ordinal
data$age <- factor(data$age, 
                   ordered = T, 
                   levels = c("5-14",
                              "15-24", 
                              "25-34", 
                              "35-54", 
                              "55-74", 
                              "75+"))

# Making generation ordinal
data$generation <- factor(data$generation, 
                   ordered = T, 
                   levels = c("G.I. Generation", 
                              "Silent",
                              "Boomers", 
                              "Generation X", 
                              "Millenials", 
                              "Generation Z"))

data <- as_tibble(data)
```


## 2.4 Outliers

Detecting and addressing outliers is a fundamental step in data preprocessing, especially for linear regression models that are significantly influenced by outliers.

Several methods exist to identify outliers, including:

* **Visual Inspection** using Boxplots and Scatterplots: These plots offer a straightforward way to visually identify outliers. Boxplots are particularly useful for univariate analysis, while scatterplots facilitate bivariate analysis.

* **Z-Score Method**: This technique labels any data point that deviates more than three standard deviations from the mean as an outlier. However, this method is only effective when the data is completely or nearly normally distributed. Hence, it's not ideal for skewed data.

* **Tukey's Fences**:It is calculated by creating a “fence” boundary a distance of 1.5 IQR beyond the 1st and 3rd quartiles. Any data beyond these fences are considered to be outliers.This method provides a robust mechanism to spot outliers, even in skewed distributions.

Once outliers are identified, we can employ several strategies to handle them, such as:

* **Rescaling and Transforming Data**: Techniques such as log transformation, square root transformation, or cube root transformation can help lessen the data skewness and mitigate the effects of outliers.

* **Truncation or Winsorization**: This method caps the outliers at a specified percentile of the data, like the 5th or 95th percentile.

* **Removing Outliers**: In extreme cases, when we are confident that an outlier arises from incorrect data entry or measurement, we might decide to eliminate these values to avoid their undue influence on our model. However, this method should be a last resort, as it might result in information loss and should be justified thoroughly.


Let's begin by examining the distribution and range of our features to better understand the spread and dispersion of our data.

```{r}
summary(data)
```

### 2.4.1 Visualize Distribution of the Data

```{r}

# Set the overall layout for the combined plot
par(mfrow = c(3, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    hist(data[[col]], main=col, xlab=col, col = "#13527a", border = "#ebebeb", cex.main = 1)
  }
}

```

### 2.4.2 BoxPlots

```{r}
# Set the overall layout for the combined plot
par(mfrow = c(2, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    boxplot(data[[col]], main=col, xlab=col, col = "#ebebeb", border = "#13527a", cex.main = 1)
  }
}

```
From the above boxplots, it's evident that "suicides_no", "population", and "GDP_for_year" all exhibit a significant number of outliers. Additionally, "suicide_ratio" and "GDP_per_capita" also show a substantial number of outlier values.


Our examination of the boxplots and histograms reveals a significant concentration of data within the 'suicide_no' and 'suicide_ratio' parameters, skewed towards zero. This high density around zero manifests as a long tail in the distribution towards the right.
In order to address this skewness, we need to delve deeper into the records where 'suicide_no' is recorded as zero.


```{r}
# Define a common theme
common_theme <- theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    legend.position = "none",
    panel.grid.major = element_line(color = "grey", linewidth = 0.1),
    panel.grid.minor = element_blank()
  )

# Define a common color
common_color <- "steelblue"

# Data for the first plot
zero_suicides_data <- data %>%
  filter(suicides_no == 0) %>%
  group_by(age) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = age, y = count, fill=age)) +
    geom_bar(stat = "identity") +
    labs(x = "Age", y = "Count", 
         title = "Zero Suicides by Age Group") +
    common_theme

# Data for the second plot
age_plot <- data %>%
  group_by(age) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
  ggplot(aes(x = age, y = suicide_per_100k, fill = age)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Global suicides per 100k, by Age",
       x = "Age", 
       y = "Suicides per 100k") +
  common_theme +
  scale_y_continuous(breaks = seq(0, 30, 1), minor_breaks = NULL) # Changed breaks for better visibility

# Arrange the plots
grid.arrange(age_plot, zero_suicides_data, ncol = 2)

```
Our observations suggest that a significant proportion of zero-suicide instances are within the 5-14 age bracket. The overall suicide average for this age group is notably low, generally under one.

This age group, however, doesn't adequately represent the larger population. The reasons for suicide within this age bracket are likely to be fundamentally different from those of other groups, potentially influenced by unique causes.

Given these considerations, we have opted to exclude the 5-14 age group from our data. This decision stems from the realization that this group exhibits distinctly different behaviors and is not representative of the broader population in the context of suicide rates.

```{r}
data <- data%>%
  filter(age != '5-14')
```

```{r}
dim(data)
```

### 2.4.3 Tukey's Fences

As indicated by the histograms above, most of our data does not adhere to a normal distribution. To yield more specific results, we will initially employ Tukey's Fences method to identify the number of outliers in each feature.

```{r}
#check the number of outliers in each features

# Define the outlier_count function
Tukey_outlier_count <- function(col) {
  q75 <- quantile(col, 0.75, na.rm= TRUE)
  q25 <- quantile(col, 0.25, na.rm= TRUE)
  iqr <- q75 - q25
  min_val <- q25 - (iqr * 1.5)
  max_val <- q75 + (iqr * 1.5)
  outlier_count <- sum(col > max_val | col < min_val)
  outlier_percent <- round(outlier_count / length(col) * 100, 2)
  return(c(outlier_count, outlier_percent))
}

# Get numeric data
numeric_data <- data[, sapply(data, is.numeric)]

# Apply the function to numeric columns
outliers <- sapply(numeric_data, Tukey_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)
```

### 2.4.4 Transformations 

The results from the Tukey's method, consistent with our boxplot observations, reveal a significant proportion of outliers in the "suicides_no", "suicide_ratio", "population", and "GDP_for_year" data. In an attempt to reduce the impact of these outliers, we plan to enrich our dataset with additional columns, each representing log-transformed and square root-transformed values of these variables.

However, we face a challenge with the "suicide_no" and "suicide_ratio" variables as they contain zero values, making it impossible to apply a straightforward log transformation. To circumvent this issue, we'll introduce an adjustment factor, a constant c=1, to all suicide numbers. Subsequently, we'll compute a new ratio and apply a log transformation to it. This approach ensures a smooth and successful transformation process.

```{r}
# Add a small constant to avoid undefined log values
c <- 1

data <- data %>%
  mutate(new_suicides_no = suicides_no + c,
         new_suicide_ratio = new_suicides_no / population,
         log_population = log(population),
         log_GDP_year = log(GDP_for_year),
         log_GDP_capita = log(GDP_per_capita),
         log_suicide_no = log(new_suicides_no),
         log_suicide_ratio = log(new_suicide_ratio)
         )
```

In our analysis, we opted for the natural logarithm for its ease of interpretation. While logarithmic transformations with different bases don't alter the distribution's form, they do have implications for how we interpret the coefficients in our model.

With the natural logarithm (base e), coefficients in a model where both the predictor (x) and response (y) variables are log-transformed indicate the percentage change in y corresponding to a 1% change in x.

On the other hand, if a base-10 logarithm were used in the same circumstances, each coefficient would represent the change in y associated with a 10% change in x.

Therefore, by using the natural logarithm, we simplify the interpretation of our model's output, enabling more straightforward conclusions and discussions.

```{r}
data <- data %>%
  mutate(sqrt_population = sqrt(population),
         sqrt_GDP_year = sqrt(GDP_for_year),
         sqrt_GDP_capita = sqrt(GDP_per_capita),
         sqrt_suicide_no = sqrt(suicides_no),
         sqrt_suicide_ratio = sqrt(suicide_ratio)
         )
```


```{r}
# Define the list of columns to be processed
transformed_col = c("population","log_population", "sqrt_population",
                    "GDP_for_year", "log_GDP_year", "sqrt_GDP_year", 
                    "GDP_per_capita", "log_GDP_capita", "sqrt_GDP_capita" , 
                    "suicides_no", "log_suicide_no", "sqrt_suicide_no",
                    "suicide_ratio","log_suicide_ratio", "sqrt_suicide_ratio")

# Apply the function to numeric columns
transformed_data <- data[, transformed_col]
transformed_outliers <- sapply(transformed_data, Tukey_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(transformed_outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)
```


```{r}

# Set the overall layout for the combined plot
par(mfrow = c(3, 4))
par(mar = c(2, 2, 2, 2))  # Adjust the margins for each plot

# for each column in the dataframe

for(col in names(data)) {
  # if it's a numeric column
  if(is.numeric(data[[col]])) {
    # create a histogram
    hist(data[[col]], main=col, xlab=col, col = "#13527a", border = "#ebebeb", cex.main = 1)
  }
}
```

Upon applying the log transformation, we noticed a remarkable decrease in the number of outliers. Moreover, the transformed data showed a tendency towards a more normal distribution, indicating the effectiveness of the transformation.

With the data distribution now less skewed and more akin to a normal distribution, we leveraged the Z-score method to further quantify the remaining outliers in each column. This allowed us a more precise examination of the data spread and outlier prevalence.

```{r}
#data = subset(data, select = -c(sqrt_population,
#                                sqrt_GDP_year,
#                                sqrt_GDP_capita,
#                                sqrt_suicide_no,
#                                sqrt_suicide_ratio))
```



### 2.4.5 Z_Score

Next, we employ the Z-Score method to identify potential outliers within each feature. To do this, we'll establish upper and lower bounds, beyond which a data point will be classified as an outlier. The calculations for these boundaries are as follows:

Upper limit: Mean + (3 * Standard Deviation)
Lower limit: Mean - (3 * Standard Deviation)

This method is based on the principle that for a normally distributed dataset, about 99.7% of data falls within three standard deviations from the mean. Hence, any data point beyond this range can be considered an outlier.

```{r}
Z_Score <- function(col){
  return((col - mean(col)) / sd(col))
}

Z_outlier_count <- function(col) {
  Upper_limit = mean(col) + (3 * sd(col))
  Lower_limit = mean(col) - (3 * sd(col))
  outlier_count <- sum(col > Upper_limit| col < Lower_limit)
  outlier_percent <- round(outlier_count / length(col) * 100, 2)
  return(c(outlier_count, outlier_percent))
}

# Get numeric data
numeric_data <- data[, sapply(data, is.numeric)]

# Apply the function to numeric columns
outliers <- sapply(numeric_data, Z_outlier_count)

# Convert to dataframe
outliers_df <- as.data.frame(t(outliers))
colnames(outliers_df) <- c("Outlier_Count", "Outlier_Percent")

# Print the result
print(outliers_df)
```
 



### 2.4.6 Exploring Outliers

In our project, "suicide_ratio" is the key variable we aim to predict using linear regression. It is important to acknowledge that linear regression models are particularly susceptible to the influence of outliers. Therefore, it is essential to adequately address and manage any outliers present in the "suicide_ratio" variable, to ensure our model's accuracy and reliability. 

let's explore outliers in suicide ratio and check if there is any pattern in them. we use tukey's fence dtected outliers.(to be edited)

```{r}
# Calculate IQR and fences
Q1 <- quantile(data$suicide_ratio, 0.25)
Q3 <- quantile(data$suicide_ratio, 0.75)
IQR <- Q3 - Q1

lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR

# Filter outliers
suicide_outliers <- data %>% 
  filter(suicide_ratio < lower_fence | suicide_ratio > upper_fence)

```

```{r}
summary(data)
```

```{r}
summary(suicide_outliers)
```
The analysis shows that the top six countries with outlier suicide ratios are the Russian Federation, Kazakhstan, Ukraine, Lithuania, Hungary, and Belarus. Intriguingly, these nations are not only geographically proximate, but also share cultural and historical links. This observation may imply potential regional trends or shared socio-economic factors influencing the elevated suicide ratios. 

Furthermore, a striking detail emerges from the outliers: nearly all, or 96.7%, are men. This finding indicates a significantly higher incidence of extreme suicide ratios among men.

```{r}
dim(suicide_outliers)
```


Due to the nature of our data, and the analysis we performed we believe that removing outliers or applying trunication(Winsorization) will cause information loss. so we keep the outliers for EDA.and we will try different methods on modeling the data to see which perform the best on our data. these are the methods we will try:
1. removing detected outliers with both Tukey's Fence and Z_score method. With removing outliers 
However, this can be risky because it assumes that the outliers are not informative and may lead to biased estimates. 

2. Robust Regression Methods: Given the number of outliers and their potential influence on the model, a robust regression method might be a good choice. These methods are less sensitive to outliers and can often provide better predictive performance when outliers are present.

3. Other Machine learning models which are less sensetive to outliers. (to be edited )


# 3. Explore Data

In this forthcoming section, we dive into the exploration of our dataset, distinguishing variables into four distinct categories.

Firstly, we have 'year' which falls under **time-dependent variables**, mapping the temporal evolution of our data.

Secondly, we have a set of **geographical and meteorological variables**. These include 'continent', 'country', 'population', and a range of temperature parameters (minimum, maximum, average) alongside their transformations, offering us insights into regional and environmental influences.

Our third category brings together **social and economic variables** such as 'life expectancy', 'GDP', and 'GDP per capita'. These, along with their respective transformations, capture the socio-economic backdrop against which we observe our data.

Lastly, our fourth category comprises **demographic variables**, namely 'sex' and 'age', allowing us to examine the influence of these vital demographics on our data.

Moreover, we have identified three potential target variables for our study: 'suicide_ratio', 'log_suicide_ratio', and 'sqrt_suicide_ratio'. Of these, 'log_suicide_ratio' has been found to be highly effective in minimizing the impact of outliers. Yet, our exploration won't be limited to it. We aim to thoroughly investigate the impact of all variables on each potential target until we embark on the modeling phase, where we will select the most suitable target variable for our predictive model.


```{r}
glimpse(data)
```

```{r}
column_name <- colnames(data)
```

## 3.1  Time-Dependent 


```{r}
# the global rate over the time period will be useful:
global_average <- (sum(as.numeric(data$suicides_no)) / sum(as.numeric(data$population))) * 100000

data %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000) %>%
  ggplot(aes(x = year, y = suicides_per_100k)) + 
  geom_line(col = "red", linewidth = 1) + 
  geom_point(col = "red", size = 2) + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", linewidth = 1) +
  labs(title = "Global Suicides (per 100k)",
       subtitle = "Trend over time, 1985 - 2015.",
       x = "Year", 
       y = "Suicides per 100k") + 
  scale_x_continuous(breaks = seq(1985, 2015, 2)) + 
  scale_y_continuous(breaks = seq(10, 20))
```
The plot above yields several insightful observations:

* The highest suicide rate recorded was 18.7 deaths per 100k population, observed in 1995.
* This rate has seen a consistent decrease, falling to 13.5 per 100k population by 2015, which translates to a significant reduction of about 27%.
* Presently, the rates are gradually regressing towards the figures prevalent prior to the 1990s.

However, a crucial aspect to remember is that the data available from the 1980s is relatively scarce, thus making it difficult to conclusively state whether these rates were an accurate reflection of the global suicide trends during that period.


### 3.1.1 Why did people killed themselves in 1995?

```{r}
data_95 <- data %>%
  filter(year == 1995) %>%
  group_by(country) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)

data_95 <- data_95 %>%
  arrange(desc(suicides_per_100k))

head(data_95)
```

During our exploration of outliers, we observed that Eastern European countries have significantly higher suicide rates compared to other nations. In particular, it appears that a substantial number of suicides in 1995 were reported in Russia.

```{r}
data_without_ru <- data %>%
  filter(country != "Russian Federation") %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)


yearly_data <- data %>%
  group_by(year) %>%
  summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)

data_without_ru <- data_without_ru %>%
  mutate(inclusion = "Without Russia")

yearly_data <- yearly_data %>%
  mutate(inclusion = "With Russia")

combined_data <- bind_rows(data_without_ru, yearly_data)

ggplot(combined_data, aes(x = year, y = suicides_per_100k, color = inclusion)) +
  geom_line() +
  labs(
    title = 'Number of Suicides for every 100k people', 
    x = 'Year', 
    y = 'Suicide Ratio per 100k',
    color = "Country Inclusion"
  ) +
  theme(legend.position = "right") +
  scale_color_manual(values = c("Without Russia" = "blue", "With Russia" = "red"))
```
The significant increase in suicide rates observed in Russia in 1995 can be attributed to a combination of several factors, each contributing to an overall sense of despair and instability within the population:

**Economic Crisis**: The period marked Russia's challenging transition from a centrally planned economy to a free-market system. This shift resulted in substantial economic turmoil characterized by high unemployment rates, rampant inflation, and general economic uncertainty. As numerous studies have indicated, such economic hardships can greatly increase stress levels within the population, thereby leading to higher suicide rates.

**Political Instability**: The dissolution of the Soviet Union in 1991 precipitated a series of radical political and societal changes. The ensuing uncertainty and the resultant feeling of insecurity may have exacerbated the already volatile situation, thereby contributing to the rise in suicide rates.

**Rise in Alcoholism**: During this period, Russia experienced an increase in alcohol abuse, a problem that has historically been a challenge for the country. A well-established body of research shows a strong correlation between alcohol abuse and suicide rates. The spike in alcoholism during this time might have, therefore, been a significant contributor to the suicide rates [links](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1642767/).

It is essential to note that while these factors are distinct, they are interrelated and likely exacerbated each other's effects on the population's mental health. 




## 3.2 Geographical and Meteorological




### 3.2.1 Population

#### 3.2.1.1 Univariate Analysis

```{r}
data$scaled_population <- (data$population - min(data$population))/(max(data$population)-min(data$population))
data$scaled_log_population <- (data$log_population - min(data$log_population)) / (max(data$log_population) - min(data$log_population))
```


```{r}
library(gridExtra)

# Original Population
p1 <- ggplot(data, aes(population)) +
  geom_histogram(binwidth=1000, fill="skyblue", color="black") +
  labs(title = "Histogram of Population",
       x = "",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Population",
       x = "",
       y = "")

# Scaled Population
p3 <- ggplot(data, aes(scaled_population)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Population",
       x = "",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Population",
       x = "",
       y = "")

# Scaled Log Population
p5 <- ggplot(data, aes(scaled_log_population)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Log Population",
       x = "",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_population)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Log Population",
       x = "",
       y = "")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

```
From these visualizations, we can observe that the distribution of the population variable becomes significantly less skewed after applying both min-max scaling and a log transformation. The resulting scaled log population appears more normally distributed and shows fewer outliers compared to the original and scaled population. Therefore, using the 'scaled_log_population' variable in our analyses should yield more robust results.


#### 3.2.1.2 Bivariate Analysis

In this section, we investigate the correlation between our candidate target variables and scaled_log_population. Since our suicide_ratio variable was derived from the ratio of suicide_no to population, our analysis focuses on the relationship between scaled_log_population and the three transformations of suicide_no: namely, suicide_no, log_suicide_no, and sqrt_suicide_no.

```{r}
data_long <- data %>%
  select(scaled_log_population, suicides_no, log_suicide_no, sqrt_suicide_no) %>%
  pivot_longer(cols = -scaled_log_population, names_to = "variable", values_to = "value")

ggplot(data_long, aes(x = scaled_log_population, y = value, color= "skyblue")) +
  geom_point(size=0.2) +
  facet_wrap(~variable, scales = "free_y") +
  labs(x = "Scaled Log Population",
       y = "Value",
       title = "Scatterplot of Scaled Log Population vs. Target Variables") + 
  theme(legend.position = "none") 

```
From these scatterplots, it appears that the relationship between scaled_log_population and log_suicide_no is more linear compared to the other variables. 


```{r}
# List of variables to calculate correlations with scaled_log_population
vars <- c("suicides_no", "log_suicide_no", "sqrt_suicide_no")

# Calculate correlations
correlations <- purrr::map_dbl(vars, ~cor(data$scaled_log_population, data[[.]]))

# Print correlations
names(correlations) <- vars
print(correlations)
```

Is it surprising that there is a high correlation between the transformation of population and suicide_no? Not at all!

Our goal is to estimate the probability of an individual taking their own life, which we calculate by dividing the suicide_no by the population in each row. However, this means we cannot directly examine the relationship between suicide_ratio and population, as we use the population directly to calculate the ratio. Instead, we need to analyze the relationship between population and suicide_no for each row.

When the population of each row increases, the probability of a higher suicide_no also tends to increase. However, this comparison may not be entirely accurate. Should we disregard population altogether? Before making that decision, let's address this question: Is there a relationship between a country's population and suicide_ratio? In other words, is the probability of an individual taking their own life higher in countries with a larger population?

To explore this, we have stratified the population of each country into three categories (big, medium, and small) for each year. This allows us to examine whether there is a notable difference in suicide_ratio based on the size of the population.

Before moving forward, let's create a dataframe that calculates the population of each country over the years.
```{r}
pop_data <- data %>%
  group_by(year, country) %>%
  summarize(population = sum(population, na.rm = TRUE), .groups = "drop")
```

We aim to establish thresholds for each year to classify countries into four categories: very small, small, medium, and large. It is important to identify natural gaps in the distribution of population for each country within a given year. It should be noted that a median-based approach is not suitable since it would result in three categories of equal size. Additionally, we need to compare the population of each country with the populations of other countries within the same year, as a country may not be densely populated at present but could experience a significant increase in population in the future.

To accomplish this, we employ the Jenks natural breaks classification method for each year. This method aims to minimize the variance within classes while maximizing the variance between classes. It involves an iterative process that reallocates observations from one class to another until an optimal arrangement is achieved.

```{r}
# Load required library
library(classInt)

# Define a function to apply Fisher-Jenks method for binning
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Very_Small","Small", "Medium", "Large","Very_Large" ), include.lowest = TRUE)
}

# Add the new column to the data frame
pop_data <- pop_data %>%
  group_by(year) %>%
  mutate(population_bine_jenks = fisher_jenks(population))

# View the first few rows of the new data
head(pop_data)
```
Now, let's examine the population categories we've created:

```{r}
# Tabulate the categories
pop_distribution <- table(pop_data$population_bine_jenks)
print(pop_distribution)
```
From the output, we notice that the category "Very Large" consists of only one country, the USA. To have more balanced categories, we'll combine the USA with the "Large" category and rename accordingly:


```{r}
# Adjust the categories
pop_data <- pop_data %>%
  mutate(population_bine_jenks = ifelse(population_bine_jenks == "Very_Large", "Large",
                                        as.character(population_bine_jenks)))

# Check the distribution after adjustment
adjusted_pop_distribution <- table(pop_data$population_bine_jenks)
print(adjusted_pop_distribution)
```

```{r}
pop_data
```



```{r}
# Join the population categories into our original data
data<- data %>%
  left_join(pop_data%>%
  select(country,year,population_bine_jenks),by = c("year", "country"))
```
We also bin the population based on the median and compare its performance with the Jenks natural breaks classification method in our model.
```{r}
data_sum <- data %>%
  group_by(country, year) %>%
  summarise(population = sum(population))

thresholds <- quantile(data_sum$population, probs = c(0.25, 0.5, 0.75))

# Assign each country-year pair to a population category
data_binned <- data_sum %>%
  mutate(
    population_bine_median = case_when(
      population <= thresholds[1] ~ "Very_Small",
      population > thresholds[1] & population <= thresholds[2] ~ "Small",
      population > thresholds[2] & population <= thresholds[3] ~ "Medium",
      TRUE ~ "Large"
    )
  )

data <- data %>%
  left_join(data_binned%>%
  select(country,year,population_bine_median),by = c("year", "country"))
```
In this step, we will explore the relationship between log_suicide_ratio (chosen because it follows a normal distribution) and population_binde_jenks.

Let's start by examining the descriptive statistics for both log_suicide_ratio and suicide_ratio.

```{r}
data %>%
  group_by(population_bine_jenks) %>%
  summarise(
    mean_suicide_ratio = mean(log_suicide_ratio, na.rm = TRUE),
    median_suicide_ratio = median(log_suicide_ratio, na.rm = TRUE),
    min_suicide_ratio = min(log_suicide_ratio, na.rm = TRUE),
    max_suicide_ratio = max(log_suicide_ratio, na.rm = TRUE)
  )
```
```{r}
data_grouped <- data %>%
  group_by(population_bine_jenks) %>%
  summarise(
    suicide_ratio = sum(suicides_no) / (sum(population) )
  )

ggplot(data_grouped, aes(x = population_bine_jenks, y = suicide_ratio, fill = population_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Population Bin", y = "Suicide Ratio", title = "Suicide Ratio for each Population Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")
```
The above plot suggests that the suicide ratio in Large countires is much higher. Now we use statistical tests to see if this difference is statisticaly siginifticant. Since we have more than two samples we can use ANOVA.
It's important to keep in mind that ANOVA makes several assumptions, including that the residuals are normally distributed, that the variances are equal across groups, and that the observations are independent. We should check these assumptions before interpreting the results.


```{r}
# Fit the model
pop_anova <- aov(log_suicide_ratio ~ population_bine_jenks, data = data)

# Run the ANOVA
anova_result <- anova(pop_anova)

# Print the result
print(anova_result)
```

Testing Normality of Residuals Assumption for ANOVA:
```{r}
# Create a data frame for residuals
residuals_df <- data.frame(residuals = residuals(pop_anova))

# Create histogram of residuals
hist_plot <- ggplot(residuals_df, aes(x = residuals)) +
  geom_histogram(fill = 'steelblue', color = 'black', bins = 30) +
  theme_minimal() +
  labs(x = "Residuals", y = "Frequency",
       title = "Histogram of Residuals")

# Create Q-Q plot of residuals
qq_plot <- ggplot(residuals_df, aes(sample = residuals)) +
  geom_qq(color = 'steelblue') +
  geom_qq_line(color = 'red') +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

# Arrange the plots side by side using the gridExtra package
library(gridExtra)
grid.arrange(hist_plot, qq_plot, ncol = 2)
```

Testing Homogeneity of Variances Assumption for ANOVA:
```{r}
leveneTest(log_suicide_ratio ~ age, data = data)
```
```{r}
bartlett.test(log_suicide_ratio ~ age, data = data)
```
The small p-values in our tests lead us to reject the null hypothesis, which posits that the variances are equivalent across different groups. This implies that the results from an ANOVA test may not be reliable in our case. Therefore, we turn to the Kruskal-Wallis test.

The Kruskal-Wallis test is a non-parametric technique that assesses whether samples originate from the same distribution. It is suitable for comparing two or more independent samples of equal or differing sample sizes, effectively extending the Mann-Whitney U test, which is utilized for comparing only two groups.

Unlike the one-way ANOVA and t-tests, the Kruskal-Wallis test does not require the residuals to be normally distributed. Consequently, it can be used with continuous data that doesn't follow a normal distribution. However, akin to ANOVA, it evaluates whether the mean ranks of the groups are different, not the means themselves.

The assumptions for the Kruskal-Wallis test are:

Independence: Observations within and between each sample should be independent, implying that one observation's presence or absence doesn't influence another observation's presence or absence.

Ordinal Data: The dependent variable must be ordinal at a minimum, meaning that the observations can be ordered. It should be possible to say that one observation is greater than, equal to, or less than another observation.

Shape of Distribution: Although the Kruskal-Wallis test doesn't require a specific data distribution like ANOVA does, it assumes that the shape of the distribution is identical for each group. While groups may have differing medians, their distribution's overall shape should be the same.

For our data, we assume independence. Our 'population_bin' variable is ordinal, thereby fulfilling the second assumption. To verify the third assumption, we need to evaluate the distribution shapes via plots.

```{r}
# Histogram
ggplot(data, aes(x =log_suicide_ratio)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~ population_bine_jenks)

# Density plot
ggplot(data, aes(x = log_suicide_ratio)) +
  geom_density() +
  facet_wrap(~ population_bine_jenks)

# Q-Q plot
ggplot(data, aes(sample = log_suicide_ratio)) +
  stat_qq() +
  facet_wrap(~ population_bine_jenks)
```

```{r}
kruskal.test(log_suicide_ratio ~ population_bine_jenks, data = data)
```
The result of the Kruskal-Wallis test aligns with that of the ANOVA. The obtained p-value is significantly small, indicating that the suicide rate differs significantly among different population categories.


```{r}
#Factorizing our two new column population_bine_jenks and population_bine_median

data$population_bine_jenks <- factor(data$population_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Small",
                              "Small",
                              "Medium",
                              "Large"))
data$population_bine_median <- factor(data$population_bine_median, 
                   ordered = T, 
                   levels = c("Very_Small",
                              "Small",
                              "Medium",
                              "Large"))
```


### 3.2.2 Country and Continent

#### 3.2.2.1 Univariate Analysis

```{r}
# Count the number of observations for each country
country_counts <- data %>% 
  group_by(country) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

# Create the bar plot
ggplot(country_counts, aes(x = reorder(country, count), y = count)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  labs(x = "Country", y = "Number of Rows", title = "Number of Rows for Each Country") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15), 
        axis.title.y = element_text(size = 10),
        axis.title.x = element_text(size = 10),
        axis.text.x = element_blank()) # Remove country names
```
```{r}
# Join the data to a map
map_data <- joinCountryData2Map(country_counts, joinCode = "NAME", nameJoinColumn = "country")

# Set margins
par(mar = c(0, 0, 1, 0))

# Plot the map
mapCountryData(map_data, 
               nameColumnToPlot = "count", 
               mapTitle = "Number of Rows for each Country", 
               colourPalette = "blues",  # change color palette here
               oceanCol = "lightblue", 
               missingCountryCol = "grey65", 
               catMethod = "pretty")
```

```{r}
# Get the unique countries per continent in the data
continent_count <- data %>%
  group_by(continent) %>%
  summarise(num_countries_data = n_distinct(country))

# Actual country count per continent
actual_count <- data.frame(
  continent = c("Africa", "Asia", "Europe", "Americas", "Oceania"),
  num_countries_actual = c(54, 48, 44, 35, 14)  # replace with actual numbers
)

# Merge the two data frames
continent_count <- merge(continent_count, actual_count, by = "continent")

# Convert to long format for plotting
continent_count_long <- continent_count %>%
  pivot_longer(cols = -continent, 
               names_to = "category", 
               values_to = "count")

# Create a bar plot
# Then, create a bar plot
ggplot(continent_count_long, aes(x = continent, y = count, fill = category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Continent", y = "Number of Countries",
       title = "Actual Countries vs Countries in Data per Continent") +
  scale_fill_discrete(name = "", labels = c("Number of Countries in Our Data", "Actual Number of Countries")) +
  theme_minimal()
```
The presented plots reveal that we have a limited number of samples from Africa, which should be taken into consideration when interpreting the results for this region. Additionally, it is important to note that our data for Asia is also relatively sparse, which may impact the robustness of any conclusions drawn for this continent.

#### 3.2.2.2 Bivariate Analysis

```{r}
country <- data %>%
  group_by(country, continent) %>%
  summarize(n = n(), 
            suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,
            .groups="drop") %>%
  arrange(desc(suicide_per_100k))

country$country <- factor(country$country, 
                          ordered = T, 
                          levels = rev(country$country))

ggplot(country, aes(x = country, y = suicide_per_100k, fill = continent)) + 
  geom_bar(stat = "identity") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Global suicides per 100k, by Country",
       x = "Country", 
       y = "Suicides per 100k", 
       fill = "Continent") +
  #coord_flip() +
  scale_y_continuous(breaks = seq(0, 45, 2)) + 
  theme(
  legend.position = "top",
  legend.key.size = unit(0.25, "cm"),
  plot.title = element_text(hjust = 0.5),
  axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1, size= 4),
  #axis.line.x = element_line(inherit.blank = TRUE)
  ) 
```
Lithuania’s rate has been highest by a large margin: > 44 suicides per 100k (per year)

```{r}
country <- data %>%
  group_by(country) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,
            .groups = "drop")

countrydata <- joinCountryData2Map(country, joinCode = "NAME", nameJoinColumn = "country")

par(mar=c(0, 0, 2, 0)) # margins

mapCountryData(countrydata, 
nameColumnToPlot="suicide_per_100k", 
mapTitle="Sucide per 100k across the Globe", 
colourPalette = "heat", 
oceanCol="lightblue", 
missingCountryCol="grey65", 
catMethod = "pretty")
```
```{r}
mapCountryData(countrydata, 
nameColumnToPlot="suicide_per_100k", 
mapTitle="Suicides per 100k in Eurasia", 
mapRegion = "eurasia", 
colourPalette = "heat", 
oceanCol="lightblue", 
missingCountryCol="grey65", 
addLegend = FALSE, 
catMethod = "pretty")
```
It's essential to be aware of our data's limitations. Specifically, we're missing a significant amount of information for Africa and Asia. On top of that, eight countries were excluded due to lack of sufficient data.

Therefore, our analyses, whether on a global or continent level, might not provide a fully accurate picture. We're essentially trying to make sense of a puzzle with missing pieces.

Lastly, when comparing suicide rates between different countries, it's crucial to consider that what is recorded as a suicide can vary by country. The reliability of suicide reporting can also influence our results.

So, even though our analysis can help identify some trends, we must keep these caveats in mind when interpreting our findings.

Due to the limited availability of data from Africa, we have excluded this continent from the current analysis.
```{r}

data_continent <- data %>% filter(continent != "Africa")

# Function to calculate suicide rate per 100k
calculate_suicide_rate <- function(suicides_no, population) {
  (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000
}

# Calculating suicide rates by continent
continent <-data_continent %>%
  group_by(continent) %>%
  summarize(suicide_per_100k = calculate_suicide_rate(suicides_no, population)) %>%
  arrange(suicide_per_100k)


continent_plot <- ggplot(continent, aes(x = continent, y = suicide_per_100k, fill = continent)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Suicide Rates per 100k by Continent",
       x = "Continent", 
       y = "Suicides per 100k", 
       fill = "Continent") +
  theme(legend.position = "none") + 
  scale_y_continuous(breaks = seq(0, 20, 1), minor_breaks = F)
print(continent_plot)
```
Our preliminary analysis suggests that Europe appears to have a higher suicide rate compared to other continents. To further substantiate this observation, we should validate it with rigorous statistical tests. After all, our dataset is merely a sample and doesn't necessarily represent the whole population.

Hence, we need to investigate if the observed differences are statistically significant.

Our null hypothesis (H0) states that the mean suicide rates are identical across all continents over the 30-year span.

On the other hand, our alternative hypothesis (H1) asserts that at least one continent has a distinct mean suicide rate compared to the others.

We set our significance level at 0.05 for these tests.

For comparison, we'll initially employ the ANOVA test. If our data does not satisfy the assumptions required for ANOVA, we'll resort to the non-parametric Kruskal-Wallis test. 

```{r}
continent_anova <- aov(log_suicide_ratio ~ continent, data = data_continent)

# Run the ANOVA
anova_result <- anova(continent_anova)

# Print the result
print(anova_result)
```
```{r}
kruskal.test(log_suicide_ratio ~ continent, data = data_continent)
```
The p-values derived from both tests are significantly small, leading us to reject the null hypothesis. Thus, we can infer that at least one continent has a distinctive suicide rate. However, the ANOVA test does not provide insights about which specific continent diverges, nor does it indicate the extent of this difference.

To delve into these specifics, we'll utilize the Tukey's Honest Significant Difference (HSD) post-hoc test. This test will facilitate the identification of groups with significantly different means, providing a comprehensive understanding of our data.

```{r}
TukeyHSD(continent_anova)
```
Our initial hypothesis stands correct as the data reveals that Europe exhibits a higher suicide rate compared to other regions.
Now, let's explore if different regions within Europe exhibit distinct patterns in terms of suicide rates.

```{r}
# Define a list of countries for each region
data_europe <- data %>%
  filter(continent == "Europe")

northern <- c("Denmark", "Estonia", "Finland", "Iceland", "Ireland", "Latvia", "Lithuania", "Norway", "Sweden", "United Kingdom","Luxembourg")
southern <- c("Greece", "Italy", "Portugal", "Spain","Malta")
eastern <- c("Bulgaria", "Czech Republic", "Hungary", "Poland", "Romania", "Russian Federation", "Slovakia","Albania","Ukraine","Belarus","Montenegro","Croatia", "Serbia", "Slovenia")
western <- c("Austria", "Belgium", "France", "Germany", "Netherlands", "Switzerland")


# Add a new column 'region' based on the country
data_europe$region <- case_when(
  data_europe$country %in% northern ~ "Northern",
  data_europe$country %in% southern ~ "Southern",
  data_europe$country %in% eastern ~ "Eastern",
  data_europe$country %in% western ~ "Western",
  TRUE ~ NA_character_ # for countries not listed above
)

```


```{r}
# Group the data by region and calculate the average suicide rate per 100k and the number of unique countries
region_data <- data_europe %>%
  group_by(region) %>%
  summarise(avg_suicide_rate_per100k = sum(suicides_no, na.rm = TRUE) / sum(population) * 1e5,
            num_countries = n_distinct(country)) 

# Create the bar plot
ggplot(region_data, aes(x = reorder(region, -avg_suicide_rate_per100k), y = avg_suicide_rate_per100k, fill = region)) +
  geom_col() +
  scale_fill_brewer(palette = "Spectral") +
  labs(x = "Region", y = "Average suicide rate per 100k", 
       title = "Average Suicide Rates per 100k for Regions in Europe") +
  theme_minimal() +
  theme(legend.position = "none")
```
The presented plot indicates that the suicide ratio in Eastern Europe is notably higher compared to other regions. This finding aligns with the observations made in the previous sections regarding the countries within this region.


#### 3.2.2.3 Multivariate Analyisis

Our goal is to understand the temporal trends in suicide rates for each country. Rather than creating visualizations for all 93 countries, we adopt a more streamlined approach by fitting a linear regression model to the data for each country. This allows us to identify patterns of increase or decrease in suicide rates over time.

Specifically, we're interested in the 'year' coefficient in our linear models. This coefficient signifies the rate of change in suicide rates over time. To control for multiple comparisons, we only consider those countries with a corrected p-value less than 0.05.

To summarize, we are identifying countries where there's a statistically significant linear trend in suicide rates over time. These trends are then rank-ordered based on their rate of change, providing a clear picture of where suicide rates are increasing or decreasing most rapidly.

```{r}
# Create a summary data frame, grouping by country and year
country_year <- data %>%
  group_by(country, year) %>%
  summarize(suicides = sum(suicides_no), 
            population = sum(population), 
            suicide_per_100k = (suicides / population) * 100000, 
            gdp_per_capita = mean(GDP_per_capita),
            .groups = "drop")  # Prevents the warning about groups

# Fit a linear model for each country, tidy the output, and filter for significant trends
country_year_trends <- country_year %>%
  nest(data = -country) %>%  # Use explicit naming to prevent warning
  mutate(model = map(data, ~ lm(suicide_per_100k ~ year, data = .)),
         tidied = map(model, broom::tidy)) %>%
  unnest(cols = c(tidied))

# Adjust p-values and filter for significant results
country_year_sig_trends <- country_year_trends %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value, method = "holm")) %>%
  filter(p.adjusted < .05) %>%
  arrange(estimate)

# Make country an ordered factor
country_year_sig_trends <- mutate(country_year_sig_trends, country = factor(country, ordered = TRUE, levels = country))
```

```{r}
ggplot(country_year_sig_trends, aes(x=country, y=estimate, col = estimate)) + 
  geom_point(stat='identity', size = 2) +
  geom_hline(yintercept = 0, col = "grey", size = 1) +
  scale_color_gradient(low = "green", high = "red") +
  geom_segment(aes(y = 0, 
                   x = country, 
                   yend = estimate, 
                   xend = country), size = 1) +
  labs(title="Change per year (Suicides per 100k)", 
       x = "Country", y = "Change Per Year (Suicides per 100k)") +
  scale_y_continuous(breaks = seq(-2, 2, 0.2), limits = c(-1.5, 1.5)) +
  theme(legend.position = "none",
        axis.text.y = element_text(size= 5)) +
  coord_flip()

```
Approximately half of the countries (48 out of 96) exhibit a linear change in suicide rates as time progresses. Among these 48 countries, 32 of them (about two-thirds) show a decreasing trend. Overall, this trend presents a positive picture. However, it is worth noting that the suicide rates in Guyana and Korea are a cause for concern as they display concerning patterns.



```{r}
### Lets look at those countries with the steepest increasing trends

top12_increasing <- tail(country_year_sig_trends$country, 12)

country_year %>%
  filter(country %in% top12_increasing) %>%
  ggplot(aes(x = year, y = suicide_per_100k, col = country)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  facet_wrap(~ country) + 
  theme(legend.position = "none") + 
  labs(title="12 Steepest Increasing Trends", 
       subtitle="Of countries with significant trends (p < 0.05)", 
       x = "Year", 
       y = "Suicides per 100k")
```
The historical data for Guyana raises concerns due to a seemingly improbable jump in the suicide rate. While Guyana is known for having high suicide rates, the sudden increase observed appears questionable. It is possible that changes in how suicide cases were classified or reported could have influenced this significant surge in the dat

```{r}
continent_time <- data_continent %>%
  group_by(year, continent) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000, .groups="drop")

continent_time$continent <- factor(continent_time$continent, ordered = T, levels = continent$continent)

continent_time_plot <- ggplot(continent_time, aes(x = year, y = suicide_per_100k, col = factor(continent))) + 
  facet_grid(continent ~ ., scales = "free_y") + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time, by Continent", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Continent") + 
  theme(legend.position = "none", title = element_text(size = 10)) + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)
print(continent_time_plot)
```

* Europe, having the highest suicide rate overall, has experienced a consistent decline of approximately 40% since 1995.
* By 2015, Europe's suicide rate had converged with that of Asia and Oceania, showing similar levels.
* In contrast to the global downward trend, Oceania and the Americas demonstrate an upward trajectory in suicide rates.
This distinct pattern is concerning and calls for a thorough investigation into the underlying factors contributing to this rise, as well as the implementation of effective interventions.

### 3.2.3 Tempurate
We have three variables: avg_temp, max_temp, and min_temp. These variables represent the average, maximum, and minimum temperatures, respectively, for each country in each year.


#### 3.2.3.1 Univariate Analysis

```{r}
library(gridExtra)
# Function to create a list of plots for each variable
create_plots <- function(variable_name, data) {
  p1 <- ggplot(data, aes_string(variable_name)) +
    geom_boxplot() 

  p2 <- ggplot(data, aes_string(variable_name)) +
    geom_histogram(bins = 30, fill = "steelblue", color = "white") 

  p3 <- ggplot(data, aes_string(variable_name)) +
    geom_density(fill = "steelblue")

  list(p1, p2, p3)
}

# Create plots for each temperature variable
avg_temp_plots <- create_plots("avg_temp", data)
min_temp_plots <- create_plots("min_temp", data)
max_temp_plots <- create_plots("max_temp", data)

# Arrange the plots in a grid
grid.arrange(grobs = c(avg_temp_plots, min_temp_plots, max_temp_plots), ncol = 3)

```


#### 3.2.3.2 Bivariate Analysis

Before calculating the correlation between temperature variables and suicide ratio, it is important to remove extreme outliers from the dataset.


```{r}
library(gridExtra)
library(knitr)

# Function to process each temperature variable
correlation_plots <- function(temp_var, data) {
  
  # Calculate and print the correlation before removing outliers
  corr_before <- cor(data[[temp_var]], data$suicide_ratio, use = "complete.obs")
  
  # Calculate quartiles and IQR
  Q1 <- quantile(data[[temp_var]], 0.25, na.rm = TRUE)
  Q3 <- quantile(data[[temp_var]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  # Identify outliers
  outliers <- data[[temp_var]] < (Q1 - 1.5 * IQR) | data[[temp_var]] > (Q3 + 1.5 * IQR)
  
  # Remove outliers
  data_no_outliers <- data[!outliers, ]
  
  # Calculate and print the correlation after removing outliers
  corr_after <- cor(data_no_outliers[[temp_var]], data_no_outliers$suicide_ratio, use = "complete.obs")
  
  # Create scatter plot with a regression line
  plot <- ggplot(data_no_outliers, aes_string(x = temp_var, y = "suicide_ratio")) +
    geom_point(color="skyblue", size=0.5) +
    geom_smooth(method = lm, color = "pink") +
    labs(x = paste("Temperature (", temp_var, ")", sep = ""), y = "Suicide Ratio", 
         title = temp_var) +
    theme_minimal()
  
  return(list(corr_before = corr_before, corr_after = corr_after, plot = plot))
}

# Call the function for each temperature variable
result_avg_temp <- correlation_plots("avg_temp", data)
result_min_temp <- correlation_plots("min_temp", data)
result_max_temp <- correlation_plots("max_temp", data)

# Combine the plots into a grid
grid.arrange(result_avg_temp$plot, result_min_temp$plot, result_max_temp$plot, nrow=1, ncol=3)

```

```{r}
# Combine the correlation results into a data frame and display as a table
correlation_results <- data.frame(
  Temperature_Variable = c("avg_temp", "min_temp", "max_temp"),
  Correlation_Before = c(result_avg_temp$corr_before, result_min_temp$corr_before, result_max_temp$corr_before),
  Correlation_After = c(result_avg_temp$corr_after, result_min_temp$corr_after, result_max_temp$corr_after)
)

print(correlation_results)
```
```{r}
cor(data$min_temp,data$sqrt_suicide_ratio)
```

```{r}
cor(data$min_temp,data$log_suicide_ratio)
```
It appears that the temperature variables exhibit a stronger correlation with the square root of the suicide ratio (sqrt_suicide_ratio). This observation is noteworthy and should be taken into account during further analysis.


Now, similar to the population variable, let's bin the temperature variables. It is important to note that when using the Jenks method, we classify the temperatures for each year separately. This approach helps reduce the influence of temperature variations over the years and provides a more accurate assessment of the temperature categories.


```{r}
temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(avg_temp), .groups = "drop")

```

```{r}
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(avg_temp_bine_jenks = fisher_jenks(temp))

# View the new data
head(temp_data)
```

```{r}
temp_data
```

```{r}
table(temp_data$avg_temp_bine_jenks)
```


Considering the limited number of countries falling under the "Freezing" temperature category, it would be appropriate to combine the "Very Cold" and "Freezing" categories into a single category. This consolidation ensures that the temperature classification remains meaningful and representative despite the scarcity of data points in the "Freezing" category
```{r}
temp_data <- temp_data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(avg_temp_bine_jenks)))
```

```{r}
table(temp_data$avg_temp_bine_jenks)
```

```{r}
# lets join this dataset to original dataset 
data<- data %>%
  left_join(temp_data%>%
  select(country,year,avg_temp_bine_jenks),by = c("year", "country"))
```

Now, let's compare the suicide ratio in each climate category.

```{r}
data_grouped <- data %>%
  group_by(avg_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )
  
ggplot(data_grouped, aes(x = avg_temp_bine_jenks, y = suicide_ratio, fill = avg_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Average temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each Climate Category") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")
```
From the plots, it is evident that countries with hot and warm climates exhibit significantly lower suicide rates. Moreover, the plot suggests that we can simplify the climate categories into just two groups: "Warm" and "Not Warm" since the suicide rates in these two categories are similar.
```{r}
data <- data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Hot", "Warm",
                                      as.character(avg_temp_bine_jenks)))
```

```{r}
data <- data %>%
  mutate(avg_temp_bine_jenks = ifelse(avg_temp_bine_jenks == "Very_Cold", "Cold",
                                      as.character(avg_temp_bine_jenks)))
```

We can apply the same procedure of analysis to the variables of maximum temperature and mean temperature as well.

```{r}
# MIN TEMP
temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(min_temp), .groups = "drop")

fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(min_temp_bine_jenks = fisher_jenks(temp))

# since we have very low country for very_cold we unify very cold and cold also 
temp_data <- temp_data %>%
  mutate(min_temp_bine_jenks = ifelse(min_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(min_temp_bine_jenks)))

# Join to the data
data<- data %>%
  left_join(temp_data%>%
  select(country,year,min_temp_bine_jenks),by = c("year", "country"))

```

```{r}
data_grouped <- data %>%
  group_by(min_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )

ggplot(data_grouped, aes(x = min_temp_bine_jenks, y = suicide_ratio, fill = min_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "min temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each min temp Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")
```

```{r}
data <- data %>%
  mutate(min_temp_bine_jenks = ifelse(min_temp_bine_jenks == "Warm", "Hot", as.character(min_temp_bine_jenks)))
```

```{r}
# Max Temp

temp_data <- data %>%
  group_by(year, country) %>%
  summarize(temp = mean(max_temp))


fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 5, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Freezing","Very_Cold", "Cold", "Warm","Hot" ), include.lowest = TRUE)
}

# Add the new column to the data frame
temp_data <- temp_data %>%
  group_by(year) %>%
  mutate(max_temp_bine_jenks = fisher_jenks(temp))

temp_data <- temp_data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Freezing", "Very_Cold",
                                      as.character(max_temp_bine_jenks)))

data<- data %>%
  left_join(temp_data%>%
  select(country,year,max_temp_bine_jenks),by = c("year", "country"))
```

```{r}
data_grouped <- data %>%
  group_by(max_temp_bine_jenks) %>%
  summarise(
  suicide_ratio = mean(suicide_ratio) )

ggplot(data_grouped, aes(x = max_temp_bine_jenks, y = suicide_ratio, fill = max_temp_bine_jenks)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "max temp Bin", y = "Suicide Ratio", title = "Suicide Ratio for each max temp Bin") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_fill_brewer(palette = "Set2")
  

```

```{r}
data <- data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Hot", "Warm", as.character(max_temp_bine_jenks)))
```

```{r}
data <- data %>%
  mutate(max_temp_bine_jenks = ifelse(max_temp_bine_jenks == "Very_Cold", "Cold", as.character(max_temp_bine_jenks)))
```

## 3.3 Social and Economical  

### 3.3.1 GDP

### 3.3.1.1 Univariate Analysis

```{r}
# Scale the GDP and log GDP variables
data$scaled_GDP_for_year <- (data$GDP_for_year - min(data$GDP_for_year))/(max(data$GDP_for_year)-min(data$GDP_for_year))
data$scaled_log_GDP_year <- (data$log_GDP_year - min(data$log_GDP_year)) / (max(data$log_GDP_year) - min(data$log_GDP_year))
```


```{r}
library(gridExtra)
# Plotting original GDP
p1 <- ggplot(data, aes(GDP_for_year)) +
  geom_histogram(binwidth=1000000000, fill="skyblue", color="black") +
  labs(title = "Histogram of GDP_for_year",
       x = "GDP_for_year",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = GDP_for_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of GDP_for_year",
       x = "",
       y = "GDP_for_year")

# Plotting scaled GDP
p3 <- ggplot(data, aes(scaled_GDP_for_year)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of scaled_GDP_for_year",
       x = "scaled_GDP_for_year",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_GDP_for_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of scaled_GDP_for_year",
       x = "",
       y = "scaled_GDP_for_year")

# Plotting scaled log GDP
p5 <- ggplot(data, aes(scaled_log_GDP_year)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of scaled_log_GDP_year",
       x = "scaled_log_GDP_year",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_GDP_year)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of scaled_log_GDP_year",
       x = "",
       y = "scaled_log_GDP_year")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

```


```{r}
# Computing scaled and log-scaled GDP per capita
data$scaled_GDP_per_capita <- (data$GDP_per_capita - min(data$GDP_per_capita)) / 
                              (max(data$GDP_per_capita) - min(data$GDP_per_capita))

data$scaled_log_GDP_capita <- (data$log_GDP_capita-min(data$log_GDP_capita))/
                              (max(data$log_GDP_capita)-min(data$log_GDP_capita)) 
```


```{r}
# Loading required library
library(gridExtra)

# Original GDP per Capita
p1 <- ggplot(data, aes(GDP_per_capita)) +
  geom_histogram(binwidth=1000, fill="skyblue", color="black") +
  labs(title = "Histogram of GDP_per_Capita",
       x = "GDP_per_Capita",
       y = "Count")

p2 <- ggplot(data, aes(x = "", y = GDP_per_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of GDP_per_Capita",
       x = "",
       y = "GDP_per_Capita")

# Scaled GDP per Capita
p3 <- ggplot(data, aes(scaled_GDP_per_capita)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled GDP_per_Capita",
       x = "Scaled GDP_per_Capita",
       y = "Count")

p4 <- ggplot(data, aes(x = "", y = scaled_GDP_per_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled GDP_per_Capita",
       x = "",
       y = "Scaled GDP_per_Capita")

# Scaled Log GDP per Capita
p5 <- ggplot(data, aes(scaled_log_GDP_capita)) +
  geom_histogram(binwidth=0.01, fill="skyblue", color="black") +
  labs(title = "Histogram of Scaled Log GDP_per_Capita",
       x = "Scaled Log GDP_per_Capita",
       y = "Count")

p6 <- ggplot(data, aes(x = "", y = scaled_log_GDP_capita)) +
  geom_boxplot(fill="lightgreen", color="black") +
  labs(title = "Box Plot of Scaled Log GDP_per_Capita",
       x = "",
       y = "Scaled Log GDP_per_Capita")

# Arrange the plots in a grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

```

### 3.3.1.2 Bivariate Analysis

```{r}
# Prepare the data in long format
long_data <- data %>% 
  select(scaled_log_GDP_year, suicide_ratio, log_suicide_ratio, sqrt_suicide_ratio) %>%
  gather(key = "Variable", value = "Value", -scaled_log_GDP_year)

# Create the plot
ggplot(long_data, aes(x = scaled_log_GDP_year, y = Value)) +
  geom_point(color = "skyblue", size=0.5) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  ggtitle("Scatterplots of scaled_log_GDP_year vs. different variables") +
  theme_minimal() +
  ylab("") +
  xlab("scaled_log_GDP_year")

```

```{r}
cor(data$scaled_log_GDP_year, data$suicide_ratio)
cor(data$scaled_log_GDP_year, data$log_suicide_ratio)
cor(data$scaled_log_GDP_year, data$sqrt_suicide_ratio)
```

```{r}
# Prepare the data in long format
long_data <- data %>% 
  select(scaled_log_GDP_capita, suicide_ratio, log_suicide_ratio, sqrt_suicide_ratio) %>%
  gather(key = "Variable", value = "Value", -scaled_log_GDP_capita)

# Create the plot
ggplot(long_data, aes(x = scaled_log_GDP_capita, y = Value)) +
  geom_point(color = "skyblue", size=0.5) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  ggtitle("Scatterplots of scaled_log_GDP_year vs. different variables") +
  theme_minimal() +
  ylab("") +
  xlab("scaled_log_GDP_year")

```
```{r}
cor(data$scaled_log_GDP_capita, data$suicide_ratio)
cor(data$scaled_log_GDP_capita, data$log_suicide_no)
cor(data$scaled_log_GDP_capita, data$sqrt_suicide_ratio)
cor(data$GDP_per_capita, data$suicide_ratio)
cor(data$GDP_per_capita, data$log_suicide_no)
cor(data$GDP_per_capita, data$sqrt_suicide_ratio)
```
I am surprised to discover that there seems to be no evident influence between GDP and suicide rates. Let's further investigate this matter.

Considering the fact that GDP tends to increase over the years for countries, it becomes clear that GDP alone may not be a reliable indicator of a country's overall wealth or prosperity. 

To gain deeper insights, we can introduce a new column called "gpd_pro_cap," which represents the share of each individual within each cluster divided by the sum of the GDP values for all countries. This calculation provides an estimation of the share of each individual within each cluster relative to the total GDP of the world during those respective years.By incorporating this new measure, we aim to capture the average share of each individual within each cluster in each year, accounting for the global GDP. This approach allows us to evaluate the relative economic position of individuals within their respective clusters over time.

Let's analyze the correlation between the average GDP per capita and the years.
```{r}
# Compute average GDP_per_capita for each year
df_yearly <- data %>%
  group_by(year) %>%
  summarise(avg_GDP_per_capita = mean(GDP_per_capita, na.rm = TRUE))

correlation <- cor(df_yearly$year, df_yearly$avg_GDP_per_capita)
print(correlation)
```

To address this issue, we can categorize GDP per capita similar to how we categorized the population variable earlier.

```{r}
gdp_data <- data %>%
  group_by(year, country) %>%
  summarize(income = mean(GDP_per_capita, na.rm = TRUE), .groups = "drop")
```

```{r}
fisher_jenks <- function(x) {
  bins <- classIntervals(x, n = 4, style = "fisher")$brks
  cut(x, breaks = bins, labels = c("Very_Low_income","Low_income", "Medium_income", "High_income" ), include.lowest = TRUE)
}

# Add the new column to the data frame
gdp_data <- gdp_data %>%
  group_by(year) %>%
  mutate(gdp_per_capita_bine_jenks = fisher_jenks(income))

# View the new data
table(gdp_data$gdp_per_capita_bine_jenks)
```
```{r}
data<- data %>%
  left_join(gdp_data%>%
  select(country,year,gdp_per_capita_bine_jenks),by = c("year", "country"))
```

```{r}
data_sum <- data %>%
  group_by(country, year) %>%
  summarise(GDP_per_capita = mean(GDP_per_capita), .groups = "drop")

thresholds <- quantile(data_sum$GDP_per_capita, probs = c(0.25, 0.5, 0.75))

# Assign each country-year pair to a population category
data_binned <- data_sum %>%
  mutate(
    gdp_bine_median = case_when(
      GDP_per_capita <= thresholds[1] ~ "Very_low_income",
      GDP_per_capita > thresholds[1] & GDP_per_capita <= thresholds[2] ~ "low_income",
      GDP_per_capita > thresholds[2] & GDP_per_capita <= thresholds[3] ~ "Medium_income",
      TRUE ~ "high_income"
    )
  )

data <- data %>%
  left_join(data_binned%>%
  select(country,year,gdp_bine_median),by = c("year", "country"))
```
By using the GDP binning method (gdp_bine_jenks) for each year, we mitigate the impact of the increasing GDP over time. Now, let's examine the suicide ratio within each GDP category to gain further insights.

```{r}
# Calculate the mean suicide_ratio for each group
mean_suicide_ratio <- data %>%
  group_by(gdp_per_capita_bine_jenks) %>%
  summarise(mean_suicide_ratio = mean(suicide_ratio))

# Print the mean_suicide_ratio dataframe
print(mean_suicide_ratio)
```


```{r}
# Plot the mean_suicide_ratio
ggplot(mean_suicide_ratio, aes(x = gdp_per_capita_bine_jenks, y = mean_suicide_ratio, fill = gdp_per_capita_bine_jenks)) +
  geom_col(show.legend = FALSE) + # Remove the color legend
  scale_fill_brewer(palette = "Set2") + # Change the color palette
  labs(title = "Mean Suicide Ratio by GDP Group",
       x = "GDP Group (Jenks Natural Breaks)",
       y = "Mean Suicide Ratio") +
  theme_minimal() + # Use a clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability

```
It appears that countries with lower GDP per capita tend to exhibit lower suicide rates. This observation suggests a possible inverse relationship between the economic status of a country and its suicide rate.

### 3.3.1.3 Multivariate Analysis

In this section, we aim to explore the relationship between a country's economic prosperity and its suicide rate. Specifically, we investigate the question: "As a country gets richer, does its suicide rate decrease?"

It depends on the country - for almost every country, there is a high correlation between year and gdp per capita, i.e. as time goes on, gdp per capita linearly increases.

```{r}
country_year_gdp <- data %>%
  group_by(country, year) %>%
  summarize(GDP_per_capita = mean(GDP_per_capita), .groups = "drop")
  
country_year_gdp_corr <- country_year_gdp %>%
  ungroup() %>%
  group_by(country) %>%
  summarize(year_gdp_correlation = cor(year, GDP_per_capita), .groups = "drop")
```
In our analysis, we examined the relationship between 'year' and 'GDP per capita' within individual countries by calculating the Pearson correlations. The results were intriguing: the mean correlation was 0.878, indicating a very strong positive linear relationship. Essentially, this suggests that an increase in wealth per person within a country is correlated with an increase in the country's suicide rate over time.

However, it's crucial to note that these trends are not uniform across all countries. While some countries show an increase in suicide rates over time, most are actually experiencing a decrease.

This leads us to ask a slightly different but equally significant question: Do wealthier countries have higher suicide rates? To explore this, we calculated the mean GDP per capita across all available years for each country, then compared this with the average suicide rate over the same period. This approach provides us with a single data point for each country, offering a general impression of a nation's affluence and its suicide rate.
```{r}
country_mean_gdp <- data %>%
  group_by(country, continent) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000, 
            gdp_per_capita = mean(GDP_per_capita), .groups = "drop")

ggplot(country_mean_gdp, aes(x = gdp_per_capita, y = suicide_per_100k, col = continent)) + 
  geom_point() + 
  scale_x_continuous(labels=scales::dollar_format(prefix="$"), breaks = seq(0, 70000, 10000)) + 
  labs(title = "Correlation between GDP (per capita) and Suicides per 100k", 
       subtitle = "Plot containing every country",
       x = "GDP (per capita)", 
       y = "Suicides per 100k", 
       col = "Continent") 
```
A number of countries in our dataset exhibit high leverage and residuals, potentially influencing the fit of our regression line. A notable example is Lithuania, which is situated in the top left of our graph. To mitigate this impact, we'll apply Cook's Distance as a measure to identify and exclude outliers. We will exclude those countries with a Cook's Distance value greater than 4/n, which is a common threshold.

After implementing this adjustment, we'll examine the revised model, now free of outliers, to better understand its statistical properties.

```{r}
model1 <- lm(suicide_per_100k ~ gdp_per_capita, data = country_mean_gdp)

gdp_suicide_no_outliers <- model1 %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  filter(.cooksd < 4/nrow(.)) %>% # removes 5/93 countries
  inner_join(country_mean_gdp, by = c("suicide_per_100k", "gdp_per_capita")) %>%
  select(country, continent, gdp_per_capita, suicide_per_100k)

model2 <- lm(suicide_per_100k ~ gdp_per_capita, data = gdp_suicide_no_outliers)

summary(model2)
```
Based on our analysis, we cannot reject the null hypothesis, which suggests that there is no linear association between the suicide rate per 100,000 population and GDP per capita for each country. However, we anticipate that when we incorporate these variables with other factors, it may reveal a linear association. We will further explore this relationship in the upcoming model chapter.


Check if the new features made any problem in the dataset.
```{r}
unfactorized_vars <- function(df) {
  var_names <- names(df)
  unfactorized <- var_names[sapply(df, function(x) is.character(x) | is.integer(x))]
  return(unfactorized)
}

# Testing the function
unfactorized_vars(data)
```

lets factorize them 

```{r}
data$avg_temp_bine_jenks <- factor(data$avg_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Cold","Warm"))

data$min_temp_bine_jenks <- factor(data$min_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Cold","Cold","Hot"))

data$max_temp_bine_jenks <- factor(data$max_temp_bine_jenks, 
                   ordered = T, 
                   levels = c("Cold","Warm"))

data$gdp_per_capita_bine_jenks <- factor(data$gdp_per_capita_bine_jenks, 
                   ordered = T, 
                   levels = c("Very_Low_income",
                              "Low_income", 
                              "Medium_income", 
                              "High_income"))
data$gdp_bine_median <- factor(data$gdp_bine_median, 
                   ordered = T, 
                   levels = c("Very_low_income",
                              "low_income", 
                              "Medium_income", 
                              "high_income"))

```


```{r}
null_percentage <- function(df) {
  # Calculates the percentage of null values in each column of a dataframe

  # Get the number of nulls in each column
  nulls <- sapply(df, function(x) sum(is.na(x)))

  # Calculate the percentage
  percentages <- nulls / nrow(df) * 100

  # Return the result as a data frame for easier viewing
  return(data.frame(Column = names(df), NullPercentage = percentages))
}

# Usage:
null_percentage(data)

```

## 3.4 Demographic Variables

For this specific part, we have data on age, generation, and sex variables. It is important to emphasize that our dataset is well-distributed among each sex and age group. Each sex and age bound is represented by a single row in our dataset, ensuring comprehensive coverage across different demographic categories.

### 3.3.1 Univariate Anaylysis

Given that our data is well-distributed across different sexes and age groups, we can proceed to visualize a bar plot for the generation variable. This will provide a visual representation of how the data is distributed among different generations.

```{r}
# Define common theme for all plots
common_theme <- theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Bar plot for Generation
generation_bar_plot <- data %>%
  ggplot(aes(x = generation, fill = generation)) +
  geom_bar() +
  labs(title = "Bar Plot of Generation",
       x = "Generation",
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  common_theme


# Arrange plots
grid.arrange(generation_bar_plot, ncol = 1)

```

### 3.3.2 Bivariate Analysis

```{r}
# Define the function
create_suicide_rate_plot <- function(group_var) {
  data %>%
    group_by(!!sym(group_var)) %>%
    summarize(suicide_per_100k = (sum(suicides_no) / sum(population)) * 100000) %>%
    ggplot(aes_string(x = group_var, y = "suicide_per_100k", fill = group_var)) +
    geom_col() +
    labs(
      #title = group_var, 
      x = group_var,
      y = ""
    ) +
    theme_minimal() +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 1),
      axis.text.x = element_text(angle = 0, hjust = 0.5, vjust = 1, size= 4),
      axis.line.x = element_line(inherit.blank = TRUE)
    ) +
    coord_cartesian(ylim = c(0, 30)) + 
    scale_fill_brewer(palette = "Set2")
}

theme_update(plot.title = element_blank(), axis.title.y = element_blank())

sex_plot <- create_suicide_rate_plot("sex")
age_plot <- create_suicide_rate_plot("age")
generation_plot <- create_suicide_rate_plot("generation")

# Arrange the plots
grid.arrange(
  top = textGrob("Global suicides per 100k", gp=gpar(fontsize=16, fontface="bold")),
  left = textGrob("Suicides per 100k", rot=90, gp=gpar(fontsize=16, fontface="bold")),
  arrangeGrob(sex_plot, age_plot, generation_plot, ncol=3)
)

```
Based on the plots, we observe that suicides are more prevalent among men and the age group of 75 years and older. Regarding the generation variable, it appears that suicides were more common in the G.I. Generation (also known as the World War II generation). However, it is important to note that our data is not evenly distributed among the different generations, with limited data available for the G.I. Generation and Millennials. This necessitates further investigation to draw more reliable conclusions.

To validate the insights from the plots and determine their statistical significance, we will employ statistical tests. These tests will help assess if the observed patterns are statistically significant or merely due to random variation.


#### 3.3.2.1 T_test for Sex

To determine if the assumption of Homogeneity of Variance is satisfied, we can employ the Levene's test. This statistical test allows us to assess if the variances are equal across the different groups under consideration. By conducting the Levene's test, we can evaluate if the Homogeneity of Variance assumption holds true in our data.
```{r}
leveneTest(log_suicide_ratio ~ sex, data = data)
```

```{r}
t.test(log_suicide_ratio ~ sex, data = data, var.equal = FALSE)
```
The test results indicate that there is a statistically significant difference between males and females. This finding suggests that the suicide rates significantly vary between the two genders.

#### 3.3.2.2 ANOVA for Age

Since we have multiple age groups to compare, we can employ the ANOVA (Analysis of Variance) test. The hypothesis for the ANOVA test is as follows:

H0: The mean suicide ratio is equal for all age groups.
H1: There is at least one age group with a different mean suicide ratio.

By conducting the ANOVA test, we can determine if there is a statistically significant difference in the mean suicide ratios among the various age groups.

```{r}
# Fit the model
age_anova <- aov(log_suicide_ratio ~ age, data = data)

# Run the ANOVA
anova_result <- anova(age_anova)

# Print the result
print(anova_result)
```
Testing Normality of Residuals Assumption for ANOVA
```{r}
# Create a data frame for residuals
residuals_df <- data.frame(residuals = residuals(age_anova))

# Create histogram of residuals
hist_plot <- ggplot(residuals_df, aes(x = residuals)) +
  geom_histogram(fill = 'steelblue', color = 'black', bins = 30) +
  theme_minimal() +
  labs(x = "Residuals", y = "Frequency",
       title = "Histogram of Residuals")

# Create Q-Q plot of residuals
qq_plot <- ggplot(residuals_df, aes(sample = residuals)) +
  geom_qq(color = 'steelblue') +
  geom_qq_line(color = 'red') +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

# Arrange the plots side by side using the gridExtra package
library(gridExtra)
grid.arrange(hist_plot, qq_plot, ncol = 2)
```

Testing Homogeneity of Variances Assumption for ANOVA
```{r}
leveneTest(log_suicide_ratio ~ age, data = data)
```
```{r}
bartlett.test(log_suicide_ratio ~ age, data = data)
```

Since the p-value from both tests is small, we reject the null hypothesis, indicating that the variances are not equal across different groups. In this scenario, using the ANOVA test may not provide accurate results. As an alternative, we can employ the Kruskal-Wallis test, which is a non-parametric test suitable for situations where the assumption of equal variances is violated.

```{r}
kruskal.test(log_suicide_ratio ~ age, data = data)
```
The result of the Kruskal-Wallis test aligns with that of the ANOVA. The obtained p-value is significantly small, indicating that there is a statistically significant difference in the means of the target variable across the levels of the categorical variable. However, the ANOVA alone does not provide information about which specific groups have different means.

To identify the specific groups with significant mean differences, we can employ Tukey's Honest Significant Difference (HSD) test. This post-hoc test allows us to conduct pairwise comparisons and determine which groups exhibit statistically significant differences in their means. By performing further investigations using the Tukey's HSD test, we can gain more insights into the specific group differences.
```{r}
TukeyHSD(age_anova)
```

#### 3.3.2.3 ANOVA for Generation
```{r}
# Fit the model
generation_anova <- aov(log_suicide_ratio ~ generation, data = data)

# Run the ANOVA
anova_result <- anova(generation_anova)

# Print the result
print(anova_result)
```
```{r}
leveneTest(log_suicide_ratio ~ generation, data = data)
```
```{r}
kruskal.test(log_suicide_ratio ~ generation, data = data)
```

The obtained small p-values indicate that there is a statistically significant difference in the suicide rate among different generations. This finding suggests that the suicide rates vary significantly across the different generational cohorts.

### 3.3.3 Multivariate Analysis

#### 3.3.3.1 Trends Over Time

##### 3.3.3.1.1 Sex
```{r}
sex_time_plot <- data %>%
  group_by(year, sex) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,
            .groups = "drop") %>%
  ggplot(aes(x = year, y = suicide_per_100k, col = factor(sex))) + 
  facet_grid(sex ~ ., scales = "free_y") + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time, by Sex", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Sex") + 
  theme(legend.position = "none") + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)

grid.arrange(sex_time_plot, ncol = 1)
```
Globally, the suicide rate for men has been approximately 3.5 times higher compared to women. Both male and female suicide rates reached their peak in 1995 and have been declining since then. It is noteworthy that the ratio of male to female suicide rates, which stands at 3.5 : 1, has remained relatively consistent since the mid-1990s. However, it is important to mention that during the 1980s, this ratio was comparatively lower.

##### 3.3.3.1.2 Age

```{r}
age_time_plot <- data %>%
  group_by(year, age) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000,.groups = "drop") %>%
  ggplot(aes(x = year, y = suicide_per_100k, col = age)) + 
  facet_grid(age ~ ., scales = "free_y") + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time, by Age", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Age") + 
  theme(legend.position = "none") + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = FALSE)

grid.arrange(age_time_plot, ncol = 1)
```

Globally, there is an increased likelihood of suicide as age advances. Since 1995, the suicide rate has been consistently decreasing for individuals across all age groups. Notably, the suicide rate for those aged 75 and above has witnessed a significant decline of over 50% since 1990. These trends reflect positive progress in addressing and reducing suicide rates among different age demographics over the past few decades.


##### 3.3.3.1.3 Generation

When dealing with continuous data, such as someone's age in a given year, it is commonly assumed that their age determines their generation. However, it is important to note that in our dataset, not everyone within the same age group in a specific year belongs to the same generation.

```{r}
# Create the plot
data %>%
  group_by(generation, age, year) %>%
  summarize(suicide_per_100k = (sum(suicides_no) / sum(population)) * 100000, .groups = 'drop') %>%
  ggplot(aes(x = year, y = suicide_per_100k, color = generation)) + 
  geom_point() + 
  geom_line() + 
  scale_color_brewer(palette = "Set2") +
  facet_grid(age ~ ., scales = "free_y") + 
  scale_x_continuous(breaks = seq(1985, 2020, 5)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Relationship between Generation, Age & Year", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Generation") + 
  theme(
    legend.position = "right"
  )

```
Understanding the trend of generation suicide rates over time becomes problematic due to overlapping age categories. When comparing the rates below with the plotted data, we notice that large spikes occur when different age groups are classified as part of a certain generation or not. For instance, in 1991, there is a supposed spike in the suicide rate for the G.I. generation. However, this spike occurs because individuals aged '55 - 75' are suddenly excluded from this generation classification.

```{r}
# Define a common theme

common_theme <- 
  theme(
    legend.position = "none",
    #strip.background = element_blank(),
    strip.text.x = element_text(size = 10, color = "black"),
    strip.text.y = element_text(size = 5, color = "white"),
    panel.background = element_rect(fill = "white", color = "black")
    #panel.grid.major = element_line(color = "grey80"),
    #panel.grid.minor = element_line(color = "grey90")
  )

# Create the suicide rate plot
generation_rate <- data %>%
  group_by(generation, year) %>%
  summarize(suicide_per_100k = (sum(suicides_no) / sum(population)) * 100000, .groups = 'drop') %>%
  ggplot(aes(x = year, y = suicide_per_100k, color = generation)) + 
  geom_point(size = 1.5, alpha = 0.8) + 
  geom_line(alpha = 0.6, linewidth= 1) + 
  facet_grid(generation ~ ., scales = "free_y") + 
  scale_x_continuous(breaks = seq(1985, 2020, 5)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Suicides per 100k, by Generation", 
       x = "Year", 
       y = "Suicides per 100k") +
  common_theme

# Create the population plot
generation_population <- data %>%
  group_by(generation, year) %>%
  summarize(population = sum(population), .groups = 'drop') %>%
  ggplot(aes(x = year, y = population / 1000000, color = generation)) + 
  geom_point(size = 1.5, alpha = 0.8) + 
  geom_line(alpha = 0.6, linewidth= 1) + 
  facet_grid(generation ~ ., scales = "free_y") + 
  scale_x_continuous(breaks = seq(1985, 2020, 5)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Population, by Generation", 
       x = "Year", 
       y = "Population (Millions)") +
  common_theme

# Arrange the plots
grid.arrange(generation_rate, generation_population, ncol = 2)

```
The issue at hand likely stems from the methodology used to create the dataset. It appears that the generation variable was added after summarizing the data by country, year, age, and sex, which is problematic. In reality, not everyone within a specific age group and year can be accurately assigned to a single generation.

As a result, the observed "spikes" in generation across time lack meaningful interpretation. Consequently, we cannot draw any conclusive conclusions regarding the suicide rates among different generations based on this dataset.


#### 3.3.3.2 Age differences, by Continent

```{r}
global_average <- (sum(as.numeric(data$suicides_no)) / sum(as.numeric(data$population))) * 100000

data %>%
  group_by(continent, age) %>%
  summarize(n = n(), 
            suicides = sum(as.numeric(suicides_no)), 
            population = sum(as.numeric(population)), 
            suicide_per_100k = (suicides / population) * 100000, .groups= "drop") %>%
  ggplot(aes(x = continent, y = suicide_per_100k, fill = age)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Age Disparity, by Continent",
       x = "Continent", 
       y = "Suicides per 100k", 
       fill = "Age")+
  coord_flip()
```
In the regions of the Americas, Asia, and Europe, which comprise the majority of the dataset, the suicide rate tends to increase with age. However, it is important to note that for Oceania and Africa, the highest suicide rates are observed among individuals aged 25 to 34. Nevertheless, due to the limited availability of data for Africa, this particular finding may not be entirely reliable. Further investigation and data collection are necessary to provide more accurate insights into the suicide rates in Africa.

#### 3.3.3.3 Gender differences, by Continent

```{r}
data %>%
  group_by(continent, sex) %>%
  summarize(n = n(), 
            suicides = sum(as.numeric(suicides_no)), 
            population = sum(as.numeric(population)), 
            suicide_per_100k = (suicides / population) * 100000,
            .groups = "drop") %>%
  ggplot(aes(x = continent, y = suicide_per_100k, fill = sex)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  geom_hline(yintercept = global_average, linetype = 2, color = "grey35", size = 1) +
  labs(title = "Gender Disparity, by Continent",
   x = "Continent", 
   y = "Suicides per 100k", 
   fill = "Sex") +
  coord_flip()
```
Between 1985 and 2015, European men faced the highest risk of suicide, with a rate of approximately 30 suicides per 100,000 population per year. In comparison, Asia had the lowest overrepresentation of male suicide, with the suicide rate for men being around 2.5 times higher than that for women. Conversely, in Europe, the male suicide rate was approximately 3.9 times higher than the female suicide rate, indicating a greater disparity between genders in suicide rates compared to Asia.

#### 3.3.3.4 Gender differences, by Country

```{r}
# Overall suicide rate by country and continent
country_long <- data %>%
  group_by(country, continent) %>%
  summarize(suicide_per_100k = (sum(suicides_no, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1e5, .groups = "drop") %>%
  mutate(sex = "OVERALL")

# Suicide rate by country, continent, and sex
sex_country_long <- data %>%
  group_by(country, continent, sex) %>%
  summarize(suicide_per_100k = (sum(suicides_no, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1e5, .groups = "drop")

# Pivot the data to wide format for visualization, and calculate the difference between Male and Female suicide rates
sex_country_wide <- sex_country_long %>%
  pivot_wider(names_from = sex, values_from = suicide_per_100k) %>%
  arrange(Male - Female)

# Convert 'country' to ordered factor based on difference in suicide rates between Male and Female
ordered_countries <- sex_country_wide$country
sex_country_wide$country <- factor(sex_country_wide$country, ordered = TRUE, levels = ordered_countries)
sex_country_long$country <- factor(sex_country_long$country, ordered = TRUE, levels = ordered_countries)

# Visualization
ggplot(sex_country_wide, aes(y = country, color = sex)) + 
  geom_dumbbell(aes(x=Female, xend=Male), color = "grey", size = 0.5) + 
  geom_point(data = sex_country_long, aes(x = suicide_per_100k), size = 0.5) +
  geom_point(data = country_long, aes(x = suicide_per_100k), size = 0.5) + 
  geom_vline(xintercept = global_average, linetype = 2, color = "grey35", linewidth = 0.5) +
  theme(axis.text.y = element_text(size = 1), legend.position = c(0.85, 0.2)) + 
  scale_x_continuous(breaks = seq(0, round(max(sex_country_wide$Male, na.rm = TRUE) + 10, -1), 10)) +
  labs(title = "Gender Disparity, by Continent & Country", 
       subtitle = "Ordered by difference in deaths per 100k.", 
       x = "Suicides per 100k", 
       y = "Country", 
       color = "Sex")

```

```{r}
country_gender_prop <- sex_country_wide %>%
  mutate(Male_Proportion = Male / (Female + Male)) %>%
  arrange(Male_Proportion)

sex_country_long$country <- factor(sex_country_long$country, 
                                   ordered = T,
                                   levels = country_gender_prop$country)

ggplot(sex_country_long, aes(y = suicide_per_100k, x = country, fill = sex)) + 
  geom_bar(position = "fill", stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Proportions of suicides that are Male & Female, by Country", 
       x = "Country", 
       y = "Suicides per 100k",
       fill = "Sex") + 
  coord_flip() +
  theme(
  legend.position = "right",
  legend.key.size = unit(0.25, "cm"),
  plot.title = element_text(hjust = 0.5),
  axis.text.y = element_text(size= 4))
```
The over representation of men in suicide deaths is a widespread phenomenon observed in various countries. Although women may have higher rates of depression and suicidal thoughts, it is men who are more likely to die by suicide. This paradoxical pattern, known as the gender paradox in suicidal behavior, highlights the complex interplay of factors such as societal expectations, help-seeking behaviors, and coping mechanisms that contribute to the gender disparity in suicide rates. It underscores the need for further research and targeted interventions to address this issue and reduce the burden of suicide among both men and women. [links](https://en.wikipedia.org/wiki/Gender_differences_in_suicide)


# 4.Model
## 4.1 modification feutures 
In this chapter, we begin with a comprehensive overview of our variables, identifying their characteristics and potential areas for refinement. We subsequently make necessary adjustments to improve their suitability for our analysis

```{r}
data1 <-data.frame(data)
```
In the initial phase of our analysis, we focus on refining our dataset for more accurate and meaningful results. Specifically, we remove certain columns that are not contributing to our understanding or prediction of the suicide ratio.

As we've discussed earlier, variables such as 'population', 'suicide_no', and their related transformations or scaled versions inherently have a strong association with our target, the 'suicide_ratio'. The 'suicide_ratio' is an estimate of the likelihood of an individual committing suicide in a specific demographic group or country.

While it might seem that 'population' would be a beneficial predictor for the 'suicide_ratio', including it may skew our results, leading to biased estimations. This is because it's not the mere size of the population, but specific characteristics within that population that lead to increased suicide ratios.

To navigate this challenge, we incorporate different features to transform and convey the crucial information contained in these variables, without directly using them. This way, we aim to create a model that captures the nuances and complexity of the factors contributing to the suicide ratio.
```{r}
remove_var = c("sqrt_suicide_no","population","log_population","new_suicides_no","new_suicide_ratio","scaled_population","scaled_log_population","scaled_GDP_for_year","scaled_log_GDP_year","log_suicide_no","scaled_log_GDP_capita","sqrt_population","sqrt_suicide_no","suicides_no" )
data1 <- dplyr::select(data1, -dplyr::one_of(remove_var))

```
In the section dedicated to outlier analysis, we observed that the log-transformed version of 'suicide_ratio' is significantly more resilient to outliers than the original 'suicide_ratio' variable. This discovery makes 'log(suicide_ratio)' a preferred candidate for our target variable, especially given its normal distribution which is a desirable property for many statistical models.

To prepare our dataset for further modeling, we standardize our predictors by rescaling them to have a mean of zero and a standard deviation of one.Finally, to maintain a tidy dataset, we drop the original untransformed and unscaled columns. This leaves us with a clean, standardized dataset that is ready for the next stages of our analysis and modeling process.
```{r}
scale_columns <- function(data, columns_to_scale) {
  # Loop over the columns
  for (col in columns_to_scale) {
    # Check if the column exists and is not all NA
    if (!col %in% names(data) || all(is.na(data[[col]]))) {
      message(paste("Column", col, "does not exist or is all NA. Skipping..."))
      next
    }
    # Create a new column name
    new_col_name <- paste0("scaled_", col)
    
    # Scale the column
    data[[new_col_name]] <- scale(data[[col]])
  }
  
  # Drop the original columns
  data <- data[, !(names(data) %in% columns_to_scale)]
  
  return(data)
}

```

```{r}
scaled_var = c("GDP_for_year","GDP_per_capita","life_exp","avg_temp","min_temp","max_temp","log_GDP_year","log_GDP_capita","sqrt_GDP_year","sqrt_GDP_capita","log_suicide_ratio","suicide_ratio","sqrt_suicide_ratio")
data2 <- scale_columns(data1,scaled_var)

```
## 4.2 Multicollinearity 
### 4.2.1 Continuous Variable 
In this section, we employ techniques such as heatmaps and Variance Inflation Factor (VIF) to investigate potential collinearity among our variables.

A heatmap is a valuable visualization tool that illustrates the correlation matrix through a gradient color scheme. By visually representing the correlation coefficients, a heatmap can reveal patterns and relationships among variables, highlighting any potential multicollinearity issues.

On the other hand, VIF is a numerical measure that quantifies the severity of multicollinearity in a regression analysis. It gauges the amount of multicollinearity by examining how much the variance of the estimated regression coefficients is increased due to multicollinearity. A high VIF suggests a high degree of collinearity with other variables, warranting attention.

These techniques collectively give us a holistic view of the correlation structure among our variables, aiding in feature selection and model performance improvement.

```{r}
# we create a new dataframe which only includes numeric columns using sapply
numeric_data <- data2[sapply(data2, is.numeric)]


width <- 25
height <- 25
options(repr.plot.width = width, repr.plot.height = height)


corr_matrix <- cor(numeric_data)

# Round the correlation matrix to 3 decimal places
rounded_corr <- round(corr_matrix, 3)

# Create the correlation plot
ggcorrplot(rounded_corr, 
           lab = TRUE, 
           lab_size = 1.5, 
           method = "circle", 
           pch = 1, 
           colors = c("red", "#ebebeb", "#13527a")) +
  theme(axis.text.x = element_text(size = 10))

```
there is strong corrilation between some of the variables

##### VIF

```{r}
remove_var <- c("scaled_sqrt_suicide_ratio","scaled_log_suicide_ratio")
data_simple <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))
remove_var <- c("scaled_sqrt_suicide_ratio","scaled_suicide_ratio")
data_log <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))
remove_var <- c("scaled_log_suicide_ratio","scaled_suicide_ratio")
data_sqrt <-dplyr::select(numeric_data, -dplyr::one_of(remove_var))

```
and for each target we make a vif graph
```{r}
mod.linear <- lm(scaled_suicide_ratio~ ., data = data_simple)
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")
```
As observable from our analysis, numerous variables exhibit high VIF, a sign of multicollinearity. This isn't surprising given that many variables were derived from one another through transformations. To handle this, we need to employ a strategy of variable clustering. Each cluster would contain variables that are highly correlated with one another, indicating potential multicollinearity.

From each cluster, we would then select the variable that strikes the best balance between having the highest correlation with our target and the least susceptibility to outliers. This approach enables us to maintain essential information while mitigating the negative effects of multicollinearity, thereby enhancing our model's predictive performance.
```{r}
gdp_var <-c("scaled_GDP_for_year","scaled_GDP_per_capita","scaled_log_GDP_year", "scaled_log_GDP_capita","scaled_sqrt_GDP_year","scaled_sqrt_GDP_capita")
temp_var<-c("scaled_min_temp","scaled_avg_temp","scaled_max_temp")
```
for suicide_ratio scaled_min_temp and scaled_GPD_per_year
lets do this test again 
```{r}
mod.linear <- lm(scaled_suicide_ratio~ ., data = subset((data_simple),select = c(year,scaled_life_exp,scaled_min_temp,scaled_GDP_for_year,scaled_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))
ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")
```
we do same for log_suicide_ratio and sqrt_suicide_ratio 
for log_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita
suicide_ratio scaled_min_temp and scaled_GPD_per_year
lets do this test again 
```{r}
mod.linear <- lm(scaled_log_suicide_ratio~ ., data = subset((data_log),select = c(year,scaled_life_exp,scaled_avg_temp,scaled_log_GDP_capita,scaled_log_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")
```

suicide_ratio scaled_min_temp and scaled_GPD_per_year
for log_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita
for sqrt_suicide_ratio scaled_avg_temp and scaled_log_GDp_capita
lets do this test again 
```{r}
mod.linear <- lm(scaled_sqrt_suicide_ratio~ ., data = subset((data_sqrt),select = c(year,scaled_life_exp,scaled_log_GDP_year,scaled_avg_temp,scaled_sqrt_suicide_ratio)))
vifs <- data.frame(vif(mod.linear))

ggplot(vifs, aes(y=vif.mod.linear., x=row.names(vifs))) + 
    geom_bar(aes(fill=vif.mod.linear.>5),stat="identity")+
    scale_y_continuous(trans = "sqrt",  breaks = c(5, 10, 50, 100))+
    geom_hline(yintercept = 5, colour = "red") + 
    ggtitle("VIF per feature for suicide_ratio as target") +
    xlab("Featurs") + ylab("VIF") +
    theme(axis.text.x=element_text(angle=20, hjust=1))+
    theme(text = element_text(size = 10))+
    scale_fill_brewer(palette="RdYlBu")
```

as we can see all of them have value less than 5 and we can say that there is no coliniarity between these variable.
### 4.2.2 Categorical variable 
The concept of multicollinearity is a bit less straightforward when applied to categorical variables, particularly because categorical variables can take on limited, and usually few, distinct values.

However, multicollinearity can still occur with categorical variables. For example, suppose you have a dataset of cars, and you have two variables: "Brand" and "Country". If every "Brand" uniquely maps to a "Country" (e.g., if 'Toyota' is always 'Japan', 'Ford' is always 'USA', etc.), then these two variables are perfectly multicollinear.
we can use chi squre but chisq is very sensetive to unbalanced variable.
we will ues Cramér's V for categorical variables. 
Cramér's V is a statistical measure that assesses the strength of association between two nominal variables. It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946.
  
Cramér's V ranges from 0 (indicating no association between the variables) to 1 (indicating a perfect association). It could be seen as an extension of the correlation coefficient to nominal data.

Cramér's V is symmetrical — it does not matter which variable we consider as independent or dependent. The formula for Cramér's V is:

V = sqrt((X^2/n) / (min(k-1, r-1)))

where:

X^2 is the chi-squared statistic,
n is the total sample size,
k is the number of columns,
r is the number of rows in the contingency table.
Just like with correlation, a value close to 0 indicates little association between the variables, and a value close to 1 indicates a strong association. However, unlike correlation, Cramér's V can only reach 1 in the case of complete association (all cells other than the diagonal are 0), or when the number of rows equals the number of columns.
first seperate categorical var
```{r}
factor_vars <- sapply(data, is.factor)

factor_vars_names <- names(data)[factor_vars]
```
```{r}
factor_vars_names
```
then we apply Cramér's V for each pair of this variable. 

```{r}
# Retrieve all the categorical variable names
factor_vars_names <- names(data[sapply(data, is.factor)])

# Initialize a data frame to hold the Cramer's V values
V_df <- data.frame(matrix(nrow = length(factor_vars_names), ncol = length(factor_vars_names)))
names(V_df) <- factor_vars_names
rownames(V_df) <- factor_vars_names

# Loop over each pair of variables
for(i in 1:length(factor_vars_names)){
  for(j in 1:length(factor_vars_names)){
    if(i != j){
      
      # Create a contingency table
      tab <- table(data[[factor_vars_names[i]]], data[[factor_vars_names[j]]])
      
      # Perform Chi-square test
      chi_sq <- chisq.test(tab)
      
      # Calculate Cramer's V
      n <- sum(tab) # total number of observations
      k <- min(dim(tab)) # number of rows or columns (whichever is smaller)
      V <- sqrt(chi_sq$statistic / (n * (k - 1)))
      
      V_df[i,j] <- V
      
      cat("Cramer's V for", factor_vars_names[i], "and", factor_vars_names[j], ":", V, "\n")
      
    } else {
      V_df[i,j] <- NA
    }
  }
}

# Replace NA values with 0
V_df[is.na(V_df)] <- 0


print(V_df)

```

```{r}
#install.packages("pheatmap")

```
```{r}
library(pheatmap)

# Make the heatmap
pheatmap(V_df, color = colorRampPalette(c("navy", "white", "firebrick3"))(25))
```
As observed, the 'country' variable demonstrates significant associations with numerous variables. This is expected given that these variables were created via a 'group_by' operation on 'country'.

However, the crucial observation is the substantial association among 'avg_temp_bine_jenks', 'min_temp_bine_jenks', and 'max_temp_bine_jenks'. For model efficiency, we should select one from this set.

To guide this selection, we ran several linear models to evaluate compatibility between these temperature variables and our potential targets ('suicide_ratio', 'log_suicide_ratio', and 'sqrt_suicide_ratio').

We compiled a dataframe featuring our three targets and the three temperature variables. The dataframe entries represent the adjusted R-squared values for each corresponding pair, providing a basis for optimal feature selection.
```{r}
targets <- c("scaled_suicide_ratio", "scaled_log_suicide_ratio", "scaled_sqrt_suicide_ratio")
variables <- c("avg_temp_bine_jenks","min_temp_bine_jenks", "max_temp_bine_jenks"
)

adjusted_r2 <- matrix(nrow = length(targets), ncol = length(variables))
rownames(adjusted_r2) <- targets
colnames(adjusted_r2) <- variables

# loop over each target and variable
for (target in targets) {
  for (var in variables) {
    
    formula <- as.formula(paste(target, var, sep = " ~ "))
    
    # fit the linear model
    model <- lm(formula, data = data2)
    
    adjusted_r2[target, var] <- summary(model)$adj.r.squared
  }
}

# convert the matrix to a data frame
adjusted_r2_df <- as.data.frame(adjusted_r2)


print(adjusted_r2_df)
```
Considering the three potential target variables, 'min_temp_bine_jenks' consistently shows better performance in terms of R-squared values.

Thus far, we have categorized our features into continuous and categorical candidates.

Our final feature candidates, as determined by their collinearity and correlation with the target variable, are as follows:

```{r}
scaled_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_GDP_for_year","min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_min_temp")
scaled_log_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_log_GDP_capita","min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_avg_temp")
scaled_sqrt_suicide_ratio_var <-c("year","country","sex","age","continent","population_bine_jenks","scaled_log_GDP_capita" ,"min_temp_bine_jenks","gdp_per_capita_bine_jenks","scaled_avg_temp")

```
## 4.3 Models selection
so for we have 3 target variable and for each one we found different proper variable. in this section we inspect different models with different criteria 

To apply linear regression we need to make sure that four conditions are satisfied:

1.No multicollinearity: no high correlation between the independent variables;
2.Linearity: there must be a linear relationship between the target variablesand the other variables;
3.Normality: the residuals must be normally distributed;
4.Homoscedasticity: the residuals must have a constant variance

in previoues section we inspect multiliniarity problem and gave proper solution for each targts
lets first make a simple model for each variable and see which conditions will meet.
```{r}
#suicide_ratio
formula <- as.formula(paste("scaled_suicide_ratio", "~", paste(scaled_suicide_ratio_var, collapse = " + ")))

model_suicide_ratio <- lm(formula, data = data2)
summary(model_suicide_ratio)
```

```{r}
data2 <- data2%>%
  filter(age != '5-14')
```

```{r}
#log_suicide_ratio
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model_log_suicide_ratio <- lm(formula, data = data2)
summary(model_log_suicide_ratio)
```
```{r}
#ssqrt_uicide_ratio
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

model_sqrt_suicide_ratio <- lm(formula, data = data2)
summary(model_sqrt_suicide_ratio)
```

Before we delve deeper into our linear regression analysis, it's crucial to emphasize that the assumptions underpinning this model don't need to be flawlessly met. However, severe violations can skew the model's accuracy and lead to misleading results.

Now, let's turn our attention to evaluating other necessary conditions for our regression model.

```{r}
plot(model_suicide_ratio,1)
```
```{r}
plot(model_log_suicide_ratio,1)
```
```{r}
plot(model_sqrt_suicide_ratio,1)
```

The linearity condition doesn't appear to be perfectly satisfied for any of our targets. However, the residuals for the log-transformed target are reasonably well-distributed and do not demonstrate any discernible patterns. On the other hand, the suicide_ratio and sqrt_suicide_ratio targets, particularly the former, do display unusual patterns in the residuals graph. Let's continue our evaluation by assessing the next assumption of our model.



```{r}
plot(model_suicide_ratio,2)
```
```{r}
plot(model_log_suicide_ratio,2)
```
```{r}
plot(model_sqrt_suicide_ratio,2)
```
```{r}
library(olsrr)
ols_plot_resid_hist(model_suicide_ratio)
ols_plot_resid_hist(model_log_suicide_ratio)
ols_plot_resid_hist(model_sqrt_suicide_ratio)
```
As observed, the majority of our residuals for the log_suicide_ratio and sqrt_suicide_ratio models lie within -1 and 1, and -1.5 and 1.5 respectively, and their distributions largely follow a normal pattern. However, the suicide_ratio model doesn't appear to satisfy these conditions as effectively.

To evaluate the final assumption - homoscedasticity of residuals - we apply the Breusch-Pagan test. The test's null hypothesis assumes homoscedasticity. If the p-value is significant (generally, less than 0.05), it suggests a deviation from this assumption, indicating heteroscedasticity.
 
```{r}
library(lmtest)
bptest(model_suicide_ratio)
bptest(model_log_suicide_ratio)
bptest(model_sqrt_suicide_ratio)
```

```{r}
plot(model_log_suicide_ratio,3)
plot(model_sqrt_suicide_ratio,3)
```
The plot visualizes the residuals' variance in relation to the predictors. Ideally, the residuals should be randomly scattered around the centerline, signifying homoscedasticity.

In our case, residuals are somewhat evenly distributed, indicating violation of homoscedasticity. This suggests our model probabiy be  less accurate across the predictor range, but it does not drastically impact the overall model's reliability.
we can conclude that suicide_rate not a good candidate for target value.

### 4.3.1 Feature selection 
Feature selection, also known as variable selection, attribute selection, or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. The goal of feature selection is three-fold:

*Improving Model Performance: When irrelevant or partially relevant features are used to construct a predictive model, the accuracy of the model can be significantly degraded. By selecting only the most relevant features to use in model construction, we can enhance the predictive accuracy of the model.

*Reducing Overfitting: Too many features in the model can lead to overfitting, where the model performs well on the training data but poorly on unseen data. By reducing the number of features, we can make the model more generalizable.

*Enhancing Interpretability: Models with fewer features are simpler and easier to interpret.

Reducing Training Time: Fewer features mean faster training times.

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable. Not all features are created equal. Some are relevant to the target variable, some are irrelevant, and some are redundant. Feature selection enables us to focus on the relevant and non-redundant features, increasing our model's performance and interpretability. 

Feature selection methods:

Forward Selection: You start with an empty model and add predictors one by one. In each step, you add the variable that gives the most significant improvement to the model.

Backward Selection: You start with the full model and remove predictors one by one. In each step, you remove the variable that is the least significant.

Mixed Selection: This is a combination of forward and backward selection. You start with an empty model, add variables as in forward selection, but after adding each new variable, the method may also remove variables that do not contribute to the model fit.
in this project we use criteria like RSS, adjr2, Mallow’s Cp (cp) and Bayesian Information Criterion (BIC).
Residual Sum of Squares (RSS): This is a measure of the discrepancy between the data and an estimation model. A small RSS indicates a tight fit of the model to the data.

Adjusted R-squared (adjr2): It is a modification of R-squared that adjusts for the number of predictors in the model. Unlike R-squared, the adjusted R-squared increases only if the new term enhances the model more than would be expected by chance.

Mallow’s Cp (cp): This criterion attempts to identify a model with a balance between under-fitting and over-fitting. Its ideal value is p (the number of predictors in the model), and a good model is a model where Cp is nearly equal to its p-value.

Bayesian Information Criterion (BIC): This criterion deals with model selection problems. Lower BIC means better model.


log_suicide_ratio
```{r}
library(leaps)

# First, fit a full model
full_model <- model_log_suicide_ratio

#log_suicide_ratio
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))



# Forward Selection
forward_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "forward")
forward_summary <- summary(forward_model_log)

# Backward Selection
backward_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "backward")
backward_summary <- summary(backward_model_log)

# Mixed (stepwise) selection
stepwise_model_log <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "seqrep")
stepwise_summary <- summary(stepwise_model_log)

# Create a dataframe with the criteria for each method
comparison_df_log <- data.frame(
  Method = c("Forward", "Backward", "Mixed"),
  RSS = c(forward_summary$rss[which.min(forward_summary$cp)], 
          backward_summary$rss[which.min(backward_summary$cp)], 
          stepwise_summary$rss[which.min(stepwise_summary$cp)]),
  AdjustedR2 = c(max(forward_summary$adjr2), max(backward_summary$adjr2), max(stepwise_summary$adjr2)),
  Cp = c(min(forward_summary$cp), min(backward_summary$cp), min(stepwise_summary$cp)),
  BIC = c(min(forward_summary$bic), min(backward_summary$bic), min(stepwise_summary$bic))
)


print(comparison_df_log)

```
forward has higher adjustedR2 and lower BIC for log.

```{r}
library(leaps)

# First, fit a full model
full_model <- model_sqrt_suicide_ratio

#log_suicide_ratio
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

  

# Forward Selection
forward_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "forward")
forward_summary <- summary(forward_model_sqrt)

# Backward Selection
backward_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "backward")

backward_summary <- summary(backward_model_sqrt)

# Mixed (stepwise) selection
stepwise_model_sqrt <- regsubsets(formula, data = data2, nvmax = length(data2)-1, method = "seqrep")
stepwise_summary <- summary(stepwise_model_sqrt)

# Create a dataframe with the criteria for each method
comparison_df_sqrt <- data.frame(
  Method = c("Forward", "Backward", "Mixed"),
  RSS = c(forward_summary$rss[which.min(forward_summary$cp)], 
          backward_summary$rss[which.min(backward_summary$cp)], 
          stepwise_summary$rss[which.min(stepwise_summary$cp)]),
  AdjustedR2 = c(max(forward_summary$adjr2), max(backward_summary$adjr2), max(stepwise_summary$adjr2)),
  Cp = c(min(forward_summary$cp), min(backward_summary$cp), min(stepwise_summary$cp)),
  BIC = c(min(forward_summary$bic), min(backward_summary$bic), min(stepwise_summary$bic))
)


print(comparison_df_sqrt)

```


forward has higher adjustedR2 and lower BIC for both log and sqrt but its much lesser than original model

```{r}
var1<-c(scaled_log_suicide_ratio_var,"scaled_log_suicide_ratio")
data3<-data2 %>% select(one_of(var1))
trainData_log <- data3 %>% filter(year <=2010)
testData_log <- data3 %>% filter(year >2010)
```

```{r}
var1<-c(scaled_sqrt_suicide_ratio_var,"scaled_sqrt_suicide_ratio")
data3<-data2 %>% select(one_of(var1))
trainData_sqrt <- data3 %>% filter(year <=2010)
testData_sqrt <- data3 %>% filter(year >2010)
```
just for seeing how much our works on data is been influential on performance of model we inspect initial data performance on models too
```{r}
initial_var<-c("country","year","sex","age","suicide_ratio","GDP_for_year","GDP_per_capita","generation","continent","life_exp","avg_temp","max_temp","min_temp")
data3<-data %>% select(one_of(initial_var))
trainData_initial <- data3 %>% filter(year <=2010)
testData_initial <- data3 %>% filter(year >2010)
```
### 4.4.1 Simple linear model 
first we train a model for target log 
```{r}
library(Metrics)
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_log)
data_test <- testData_log
# Make predictions on the testing data
predictions <- predict(model, newdata = testData_log)
target <-"scaled_log_suicide_ratio"
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_log$scaled_log_suicide_ratio, predictions)
print(paste0("MSE: ", mse))

# Calculate R-squared
sse = sum((predictions - testData_log$scaled_log_suicide_ratio)^2)
sst = sum((testData_log$scaled_log_suicide_ratio - mean(testData_log$scaled_log_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))

# Calculate Adjusted R-squared
n = length(testData_log$scaled_log_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))

```


```{r}
library(Metrics)
formula <- as.formula(paste("scaled_sqrt_suicide_ratio", "~", paste(scaled_sqrt_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_sqrt)

# Make predictions on the testing data
predictions <- predict(model, newdata = testData_sqrt)

# Calculate Mean Squared Error (MSE)
mse <- mse(testData_sqrt$scaled_sqrt_suicide_ratio, predictions)
print(paste0("MSE: ", mse))

# Calculate R-squared
sse = sum((predictions - testData_sqrt$scaled_sqrt_suicide_ratio)^2)
sst = sum((testData_sqrt$scaled_sqrt_suicide_ratio - mean(testData_sqrt$scaled_sqrt_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))

# Calculate Adjusted R-squared
n = length(testData_sqrt$scaled_sqrt_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))

```

initial_data
```{r}
library(Metrics)
a <-c("country","year","sex","age","GDP_for_year","GDP_per_capita","generation","continent","life_exp","avg_temp","max_temp","min_temp")
formula <- as.formula(paste("suicide_ratio", "~", paste(a, collapse = " + ")))

model <- lm(formula, data = trainData_initial)

# Make predictions on the testing data
predictions <- predict(model, newdata = testData_initial)

# Calculate Mean Squared Error (MSE)
mse <- mse(testData_initial$suicide_ratio, predictions)
print(paste0("MSE: ", mse))

# Calculate R-squared
sse = sum((predictions - testData_initial$suicide_ratio)^2)
sst = sum((testData_initial$suicide_ratio - mean(testData_initial$suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))

# Calculate Adjusted R-squared
n = length(testData_initial$suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))

```

it shows that our hard work was effective
### 4.4.2 Lasso regression model
Lasso regression is a type of regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. The term Lasso is an acronym for Least Absolute Shrinkage and Selection Operator.

The Lasso method introduces a penalty term to the loss function of the linear regression model that is the absolute value of the magnitude of the coefficient values, or simply the absolute value of each coefficient.
lasso for log target
```{r}

library(caret)
library(glmnet)


x_train <- model.matrix(scaled_log_suicide_ratio~., trainData_log)[,-1] # Exclude intercept column
y_train <- trainData_log$scaled_log_suicide_ratio
x_test <- model.matrix(scaled_log_suicide_ratio~., testData_log)[,-1] # Exclude intercept column
y_test <- testData_log$scaled_log_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 1, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
mse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", mse))

rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
```

for sqre target
```{r}

library(caret)
library(glmnet)


x_train <- model.matrix(scaled_sqrt_suicide_ratio~., trainData_sqrt)[,-1] # Exclude intercept column
y_train <- trainData_sqrt$scaled_sqrt_suicide_ratio
x_test <- model.matrix(scaled_sqrt_suicide_ratio~., testData_sqrt)[,-1] # Exclude intercept column
y_test <- testData_sqrt$scaled_sqrt_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 1, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
mse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", mse))

rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
```


### 4.4.3 Ridge regression
Ridge regression, also known as Tikhonov regularization, is a regularization technique designed to deal with multicollinearity, improve prediction accuracy, and interpretability of the statistical model it is applied to. Ridge regression performs "L2 regularization," which means that it adds a penalty equivalent to the square of the magnitude of the coefficients. This results in smaller coefficients, which makes the model less complex and better at generalizing from the training data to unseen data.

for log target
```{r}

library(caret)
library(glmnet)


x_train <- model.matrix(scaled_log_suicide_ratio~., trainData_log)[,-1] # Exclude intercept column
y_train <- trainData_log$scaled_log_suicide_ratio
x_test <- model.matrix(scaled_log_suicide_ratio~., testData_log)[,-1] # Exclude intercept column
y_test <- testData_log$scaled_log_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 0, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
rmse <-mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", rmse))

rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
```
for sqrt target 
```{r}
x_train <- model.matrix(scaled_sqrt_suicide_ratio~., trainData_sqrt)[,-1] # Exclude intercept column
y_train <- trainData_sqrt$scaled_sqrt_suicide_ratio
x_test <- model.matrix(scaled_sqrt_suicide_ratio~., testData_sqrt)[,-1] # Exclude intercept column
y_test <- testData_sqrt$scaled_sqrt_suicide_ratio

# Define the cross-validation experiment
cvfit <- cv.glmnet(x_train, y_train, alpha = 0, type.measure = "mse")

# Get the optimal lambda value
lambda_optimal <- cvfit$lambda.min

# Train the final model using the optimal lambda
final_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_optimal)


predictions <- predict(final_model, s = lambda_optimal, newx = x_test)

# Evaluate the performance
rmse <- mean((predictions - y_test)^2)

print(paste("MSE on the test set: ", rmse))

rsq <- 1 - sum((predictions - y_test)^2) / sum((mean(y_test) - y_test)^2)
print(paste("RSQ test :",rsq))
# Calculate adjusted R-squared
n <- length(y_test) # number of observations
p <- coef(final_model, s = "lambda.min") # number of predictors
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - length(p) - 1)

print(paste("Adjuster R squre :",adj_rsq))
```
```{r}
vec1 <-c("Simple Linear",0.2596,0.7138,0.7052,"log_suicide_ratio")
vec2<-c("Simple Linear",0.278,0.6578,0.6475,"sqrt_suicide_ratio")
vec3<-c("Simple Linear",143.18,0.485,0.4703,"suicide_ratio")
vec4<-c("Lasso",0.2604,0.7129,0.7042,"log_suicide_ratio")
vec5<-c("Lasso",0.2798,0.655,0.6453,"sqrt_suicide_ratio")
vec6<-c("Ridge",0.2615,"0.7116","0.7028","log_suicide_ratio")
vec7<- c("Ridge",0.2758,0.6607,0.6503,"sqrt_suicide_ratio")
```

# 5.Conclusions

```{r}
df <- data.frame(Model=rep(NA,7), 
                 Mean_Squared_Error=rep(NA,7), 
                 R_squared=rep(NA,7), 
                 Adjusted_R_squared=rep(NA,7), 
                 Target=rep(NA,7))
```
```{r}
df[1, ] <- vec1
df[2, ] <- vec2
df[3, ] <- vec3
df[4, ] <- vec4
df[5, ] <- vec5
df[6, ] <- vec6
df[7, ] <- vec7
```

```{r}
df
```
The simple linear regression model with log as the target slightly outperforms the others. However, simple linear, Lasso, and Ridge regressions with log as the target demonstrated quite similar performance. To distinguish more effectively between these models, we should employ cross-validation techniques.

Now, let's explore the importance of each feature in the simple linear regression model, where the target is 'log_suicide_ratio'. This will give us more insight into the significant predictors in our model.
```{r}
library(Metrics)
formula <- as.formula(paste("scaled_log_suicide_ratio", "~", paste(scaled_log_suicide_ratio_var, collapse = " + ")))

model <- lm(formula, data = trainData_log)
data_test <- testData_log
# Make predictions on the testing data
predictions <- predict(model, newdata = testData_log)
target <-"scaled_log_suicide_ratio"
# Calculate Mean Squared Error (MSE)
mse <- mse(testData_log$scaled_log_suicide_ratio, predictions)
print(paste0("MSE: ", mse))

# Calculate R-squared
sse = sum((predictions - testData_log$scaled_log_suicide_ratio)^2)
sst = sum((testData_log$scaled_log_suicide_ratio - mean(testData_log$scaled_log_suicide_ratio))^2)
r_squared = 1 - sse / sst
print(paste0("R-squared: ", r_squared))

# Calculate Adjusted R-squared
n = length(testData_log$scaled_log_suicide_ratio) # number of observations
p = length(coef(model)) - 1 # number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))
print(paste0("Adjusted R-squared: ", adjusted_r_squared))

```
```{r}
summary(model)
```
as we can see the most important feature in order are:

1. sex
2. country
3. age
4. scaled_avg_temp
5. scaled_log_GDP_capita
6. population_bine_jenks
